public class RegexURLNormalizer extends Configured implements URLNormalizer
This class uses the urlnormalizer.regex.file property. It should be set to the file name of an xml file which should contain the patterns and substitutions to be done on encountered URLs.
This class also supports different rules depending on the scope. Please see
the javadoc in URLNormalizers
for more details.
X_POINT_ID
Constructor and Description |
---|
RegexURLNormalizer()
The default constructor which is called from UrlNormalizerFactory
(normalizerClass.newInstance()) in method: getNormalizer()*
|
RegexURLNormalizer(Configuration conf) |
RegexURLNormalizer(Configuration conf,
java.lang.String filename)
Constructor which can be passed the file name, so it doesn't look in the
configuration files for it.
|
Modifier and Type | Method and Description |
---|---|
java.util.HashMap<java.lang.String,java.util.List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>> |
getScopedRules() |
static void |
main(java.lang.String[] args)
Spits out patterns and substitutions that are in the configuration file.
|
java.lang.String |
normalize(java.lang.String urlString,
java.lang.String scope) |
java.lang.String |
regexNormalize(java.lang.String urlString,
java.lang.String scope)
This function does the replacements by iterating through all the regex
patterns.
|
void |
setConf(Configuration conf) |
getConf
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getConf
public RegexURLNormalizer()
public RegexURLNormalizer(Configuration conf)
public RegexURLNormalizer(Configuration conf, java.lang.String filename) throws java.io.IOException, java.util.regex.PatternSyntaxException
java.io.IOException
java.util.regex.PatternSyntaxException
public java.util.HashMap<java.lang.String,java.util.List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>> getScopedRules()
public void setConf(Configuration conf)
setConf
in interface Configurable
setConf
in class Configured
public java.lang.String regexNormalize(java.lang.String urlString, java.lang.String scope)
public java.lang.String normalize(java.lang.String urlString, java.lang.String scope) throws java.net.MalformedURLException
normalize
in interface URLNormalizer
java.net.MalformedURLException
public static void main(java.lang.String[] args) throws java.util.regex.PatternSyntaxException, java.io.IOException
java.util.regex.PatternSyntaxException
java.io.IOException
Copyright © 2019 The Apache Software Foundation