Package | Description |
---|---|
org.apache.nutch.analysis.lang |
Text document language identifier.
|
org.apache.nutch.indexer |
Index content, configure and run indexing and cleaning jobs to
add, update, and delete documents from an index.
|
org.apache.nutch.indexer.anchor |
An indexing plugin for inbound anchor text.
|
org.apache.nutch.indexer.basic |
A basic indexing plugin, adds basic fields: url, host, title, content, etc.
|
org.apache.nutch.indexer.html |
Index raw HTML content.
|
org.apache.nutch.indexer.jsoup.extractor |
Indexing filter for jsoup-extractor plugin
|
org.apache.nutch.indexer.metadata |
Indexing filter to add document metadata to the index.
|
org.apache.nutch.indexer.more |
A more indexing plugin, adds "more" index fields:
last modified date, MIME type, content length.
|
org.apache.nutch.indexer.subcollection |
Indexing filter to assign documents to subcollections.
|
org.apache.nutch.indexer.tld |
Top Level Domain Indexing plugin.
|
org.apache.nutch.indexwriter.elastic2 |
Index writer plugin for Elasticsearch.
|
org.apache.nutch.indexwriter.hbase |
Index writer plugin for Apache HBase.
|
org.apache.nutch.indexwriter.solr |
Index writer plugin for Apache Solr.
|
org.apache.nutch.microformats.reltag |
A microformats Rel-Tag
Parser/Indexer/Querier plugin.
|
org.apache.nutch.scoring |
The
ScoringFilter interface. |
org.apache.nutch.scoring.link |
Scoring filter
|
org.apache.nutch.scoring.opic |
Scoring filter implementing a variant of the Online Page Importance Computation
(OPIC) algorithm.
|
org.apache.nutch.scoring.tld |
Top Level Domain Scoring plugin.
|
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons medadata.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
LanguageIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
LanguageIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
IndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
NutchDocument |
IndexingFilters.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
Run all defined filters.
|
NutchDocument |
IndexUtil.index(java.lang.String key,
WebPage page)
Index a
WebPage , here we add the following fields:
id: default uniqueKey for the NutchDocument .
digest: Digest is used to identify pages (like unique ID) and
is used to remove duplicates during the dedup procedure. |
Modifier and Type | Method and Description |
---|---|
RecordWriter<java.lang.String,NutchDocument> |
IndexerOutputFormat.getRecordWriter(TaskAttemptContext job) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
IndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
NutchDocument |
IndexingFilters.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
Run all defined filters.
|
void |
IndexWriter.update(NutchDocument doc) |
void |
IndexWriters.update(NutchDocument doc) |
void |
IndexWriter.write(NutchDocument doc) |
void |
IndexWriters.write(NutchDocument doc) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
AnchorIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
The
AnchorIndexingFilter filter object which supports boolean
configuration settings for the deduplication of anchors. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
AnchorIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
The
AnchorIndexingFilter filter object which supports boolean
configuration settings for the deduplication of anchors. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
BasicIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
The
BasicIndexingFilter filter object which supports boolean
configurable value for length of characters permitted within the title @see
indexer.max.title.length in nutch-default.xml |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
BasicIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
The
BasicIndexingFilter filter object which supports boolean
configurable value for length of characters permitted within the title @see
indexer.max.title.length in nutch-default.xml |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
HtmlIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
HtmlIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
JsoupIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
JsoupIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MetadataIndexer.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MetadataIndexer.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MoreIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MoreIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
SubcollectionIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
SubcollectionIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
TLDIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
TLDIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
void |
ElasticIndexWriter.update(NutchDocument doc) |
void |
ElasticIndexWriter.write(NutchDocument doc) |
Modifier and Type | Method and Description |
---|---|
void |
HBaseIndexWriter.update(NutchDocument doc) |
void |
HBaseIndexWriter.write(NutchDocument doc) |
Modifier and Type | Method and Description |
---|---|
void |
SolrIndexWriter.update(NutchDocument doc) |
void |
SolrIndexWriter.write(NutchDocument doc) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
RelTagIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
The
RelTagIndexingFilter filter object. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
RelTagIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
The
RelTagIndexingFilter filter object. |
Modifier and Type | Method and Description |
---|---|
float |
ScoringFilter.indexerScore(java.lang.String url,
NutchDocument doc,
WebPage page,
float initScore)
This method calculates a Lucene document boost.
|
float |
ScoringFilters.indexerScore(java.lang.String url,
NutchDocument doc,
WebPage row,
float initScore) |
Modifier and Type | Method and Description |
---|---|
float |
LinkAnalysisScoringFilter.indexerScore(java.lang.String url,
NutchDocument doc,
WebPage page,
float initScore) |
Modifier and Type | Method and Description |
---|---|
float |
OPICScoringFilter.indexerScore(java.lang.String url,
NutchDocument doc,
WebPage row,
float initScore)
Dampen the boost value by scorePower.
|
Modifier and Type | Method and Description |
---|---|
float |
TLDScoringFilter.indexerScore(java.lang.String url,
NutchDocument doc,
WebPage page,
float initScore) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
CCIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
void |
CCIndexingFilter.addUrlFeatures(NutchDocument doc,
java.lang.String urlString)
Add the features represented by a license URL.
|
NutchDocument |
CCIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Copyright © 2019 The Apache Software Foundation