Package | Description |
---|---|
org.apache.nutch.analysis.lang |
Text document language identifier.
|
org.apache.nutch.indexer |
Index content, configure and run indexing and cleaning jobs to
add, update, and delete documents from an index.
|
org.apache.nutch.indexer.anchor |
An indexing plugin for inbound anchor text.
|
org.apache.nutch.indexer.basic |
A basic indexing plugin, adds basic fields: url, host, title, content, etc.
|
org.apache.nutch.indexer.html |
Index raw HTML content.
|
org.apache.nutch.indexer.jsoup.extractor |
Indexing filter for jsoup-extractor plugin
|
org.apache.nutch.indexer.metadata |
Indexing filter to add document metadata to the index.
|
org.apache.nutch.indexer.more |
A more indexing plugin, adds "more" index fields:
last modified date, MIME type, content length.
|
org.apache.nutch.indexer.subcollection |
Indexing filter to assign documents to subcollections.
|
org.apache.nutch.indexer.tld |
Top Level Domain Indexing plugin.
|
org.apache.nutch.microformats.reltag |
A microformats Rel-Tag
Parser/Indexer/Querier plugin.
|
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons medadata.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
LanguageIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
IndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
NutchDocument |
IndexingFilters.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
Run all defined filters.
|
boolean |
IndexCleaningFilters.remove(java.lang.String url,
WebPage page)
Run all defined filters.
|
boolean |
IndexCleaningFilter.remove(java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
AnchorIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
The
AnchorIndexingFilter filter object which supports boolean
configuration settings for the deduplication of anchors. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
BasicIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
The
BasicIndexingFilter filter object which supports boolean
configurable value for length of characters permitted within the title @see
indexer.max.title.length in nutch-default.xml |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
HtmlIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
JsoupIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MetadataIndexer.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MoreIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
SubcollectionIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
TLDIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
RelTagIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
The
RelTagIndexingFilter filter object. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
CCIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Copyright © 2019 The Apache Software Foundation