public class AnchorIndexingFilter extends java.lang.Object implements IndexingFilter
anchorIndexingFilter.deduplicate
in nutch-default.xml.X_POINT_ID
Constructor and Description |
---|
AnchorIndexingFilter() |
Modifier and Type | Method and Description |
---|---|
void |
addIndexBackendOptions(Configuration conf) |
NutchDocument |
filter(NutchDocument doc,
java.lang.String url,
WebPage page)
The
AnchorIndexingFilter filter object which supports boolean
configuration settings for the deduplication of anchors. |
Configuration |
getConf()
Get the
Configuration object |
java.util.Collection<WebPage.Field> |
getFields()
Gets all the fields for a given
WebPage Many datastores need to
setup the mapreduce job by specifying the fields needed. |
void |
setConf(Configuration conf)
Set the
Configuration object |
public void setConf(Configuration conf)
Configuration
objectsetConf
in interface Configurable
public Configuration getConf()
Configuration
objectgetConf
in interface Configurable
public void addIndexBackendOptions(Configuration conf)
public NutchDocument filter(NutchDocument doc, java.lang.String url, WebPage page) throws IndexingException
AnchorIndexingFilter
filter object which supports boolean
configuration settings for the deduplication of anchors. See
anchorIndexingFilter.deduplicate
in nutch-default.xml.filter
in interface IndexingFilter
doc
- The NutchDocument
objecturl
- URL to be filtered for anchor textpage
- WebPage
object relative to the URLIndexingException
public java.util.Collection<WebPage.Field> getFields()
WebPage
Many datastores need to
setup the mapreduce job by specifying the fields needed. All extensions
that work on WebPage are able to specify what fields they need.getFields
in interface FieldPluggable
Copyright © 2019 The Apache Software Foundation