public final class URLNormalizers
extends java.lang.Object
There is one global scope defined by default, which consists of all active normalizers. The order in which these normalizers are executed may be defined in "urlnormalizer.order" property, which lists space-separated implementation classes (if this property is missing normalizers will be run in random order). If there are more normalizers activated than explicitly named on this list, the remaining ones will be run in random order after the ones specified on the list are executed.
You can define a set of contexts (or scopes) in which normalizers may be called. Each scope can have its own list of normalizers (defined in "urlnormalizer.scope.<scope_name>" property) and its own order (defined in "urlnormalizer.order.<scope_name>" property). If any of these properties are missing, default settings are used for the global scope.
In case no normalizers are required for any given scope, a
org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer
should
be used.
Each normalizer may further select among many configurations, depending on the scope in which it is called, because the scope name is passed as a parameter to each normalizer. You can also use the same normalizer for many scopes.
Several scopes have been defined, and various Nutch tools will attempt using scope-specific normalizers first (and fall back to default config if scope-specific configuration is missing).
Normalizers may be run several times, to ensure that modifications introduced by normalizers at the end of the list can be further reduced by normalizers executed at the beginning. By default this loop is executed just once - if you want to ensure that all possible combinations have been applied you may want to run this loop up to the number of activated normalizers. This loop count can be configured through urlnormalizer.loop.count property. As soon as the url is unchanged the loop will stop and return the result.
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
SCOPE_CRAWLDB
Scope used when updating the CrawlDb with new URLs.
|
static java.lang.String |
SCOPE_DEFAULT
Default scope.
|
static java.lang.String |
SCOPE_FETCHER
Scope used by
FetcherJob when processing
redirect URLs. |
static java.lang.String |
SCOPE_GENERATE_HOST_COUNT
Scope used by
GeneratorJob . |
static java.lang.String |
SCOPE_INJECT
Scope used by
InjectorJob . |
static java.lang.String |
SCOPE_LINKDB
Scope used when updating the LinkDb with new URLs.
|
static java.lang.String |
SCOPE_OUTLINK
Scope used when constructing new
Outlink
instances. |
static java.lang.String |
SCOPE_PARTITION
Scope used by
URLPartitioner . |
Constructor and Description |
---|
URLNormalizers(Configuration conf,
java.lang.String scope) |
Modifier and Type | Method and Description |
---|---|
java.lang.String |
normalize(java.lang.String urlString,
java.lang.String scope)
Normalize
|
public static final java.lang.String SCOPE_DEFAULT
public static final java.lang.String SCOPE_PARTITION
URLPartitioner
.public static final java.lang.String SCOPE_GENERATE_HOST_COUNT
GeneratorJob
.public static final java.lang.String SCOPE_FETCHER
FetcherJob
when processing
redirect URLs.public static final java.lang.String SCOPE_CRAWLDB
public static final java.lang.String SCOPE_LINKDB
public static final java.lang.String SCOPE_INJECT
InjectorJob
.public static final java.lang.String SCOPE_OUTLINK
Outlink
instances.public URLNormalizers(Configuration conf, java.lang.String scope)
public java.lang.String normalize(java.lang.String urlString, java.lang.String scope) throws java.net.MalformedURLException
urlString
- The URL string to normalize.scope
- The given scope.scope
java.net.MalformedURLException
- If the given URL string is malformed.Copyright © 2019 The Apache Software Foundation