Package | Description |
---|---|
org.apache.nutch.analysis.lang |
Text document language identifier.
|
org.apache.nutch.api.impl.db | |
org.apache.nutch.crawl |
Crawl control code and tools to run the crawler.
|
org.apache.nutch.fetcher |
The Nutch robot.
|
org.apache.nutch.host |
Host database to store metadata per host.
|
org.apache.nutch.indexer |
Index content, configure and run indexing and cleaning jobs to
add, update, and delete documents from an index.
|
org.apache.nutch.indexer.anchor |
An indexing plugin for inbound anchor text.
|
org.apache.nutch.indexer.basic |
A basic indexing plugin, adds basic fields: url, host, title, content, etc.
|
org.apache.nutch.indexer.html |
Index raw HTML content.
|
org.apache.nutch.indexer.jsoup.extractor |
Indexing filter for jsoup-extractor plugin
|
org.apache.nutch.indexer.metadata |
Indexing filter to add document metadata to the index.
|
org.apache.nutch.indexer.more |
A more indexing plugin, adds "more" index fields:
last modified date, MIME type, content length.
|
org.apache.nutch.indexer.subcollection |
Indexing filter to assign documents to subcollections.
|
org.apache.nutch.indexer.tld |
Top Level Domain Indexing plugin.
|
org.apache.nutch.microformats.reltag |
A microformats Rel-Tag
Parser/Indexer/Querier plugin.
|
org.apache.nutch.net |
Web-related interfaces: URL
filters
and normalizers . |
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.parse.html |
An HTML document parsing plugin.
|
org.apache.nutch.parse.js |
Parser and parse filter plugin to extract all (possible) links
from JavaScript files and embedded JavaScript code snippets.
|
org.apache.nutch.parse.jsoup.extractor |
Parse filter based on Jsoup
|
org.apache.nutch.parse.metatags |
Parse filter to extract meta tags: keywords, description, etc.
|
org.apache.nutch.parse.tika |
Parse various document formats with help of
Apache Tika.
|
org.apache.nutch.protocol |
Classes related to the
Protocol interface,
see also org.apache.nutch.net.protocols . |
org.apache.nutch.protocol.file |
Protocol plugin which supports retrieving local file resources.
|
org.apache.nutch.protocol.ftp |
Protocol plugin which supports retrieving documents via the ftp protocol.
|
org.apache.nutch.protocol.http |
Protocol plugin which supports retrieving documents via the http protocol.
|
org.apache.nutch.protocol.http.api |
Common API used by HTTP plugins (
http ,
httpclient ) |
org.apache.nutch.protocol.httpclient |
Protocol plugin which supports retrieving documents via the HTTP and
HTTPS protocols, optionally with Basic, Digest and NTLM authentication
schemes for web server as well as proxy server.
|
org.apache.nutch.protocol.sftp |
Protocol plugin which supports retrieving documents via the sftp protocol.
|
org.apache.nutch.scoring |
The
ScoringFilter interface. |
org.apache.nutch.scoring.link |
Scoring filter
|
org.apache.nutch.scoring.opic |
Scoring filter implementing a variant of the Online Page Importance Computation
(OPIC) algorithm.
|
org.apache.nutch.scoring.tld |
Top Level Domain Scoring plugin.
|
org.apache.nutch.storage |
Representation (
web pages ,
host metadata ) of data in abstracted storage. |
org.apache.nutch.util |
Miscellaneous utility classes.
|
org.apache.nutch.util.domain |
Classes for domain name analysis.
|
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons medadata.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
LanguageIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Parse |
HTMLLanguageParser.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Scan the HTML document looking at possible indications of content language
html lang attribute (http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1) meta dc.language (http://dublincore.org/documents/2000/07/16/usageguide/qualified -html.shtml#language) meta http-equiv (content-language) (http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.4.2) |
Modifier and Type | Method and Description |
---|---|
static java.util.Map<java.lang.String,java.lang.Object> |
DbPageConverter.convertPage(WebPage page,
java.util.Set<java.lang.String> fields) |
Modifier and Type | Field and Description |
---|---|
org.apache.gora.store.DataStore<java.lang.String,WebPage> |
DbUpdateReducer.datastore |
Modifier and Type | Method and Description |
---|---|
WebPage |
URLWebPage.getDatum() |
Modifier and Type | Method and Description |
---|---|
byte[] |
MD5Signature.calculate(WebPage page) |
byte[] |
TextMD5Signature.calculate(WebPage page) |
abstract byte[] |
Signature.calculate(WebPage page) |
byte[] |
TextProfileSignature.calculate(WebPage page) |
long |
AbstractFetchSchedule.calculateLastFetchTime(WebPage page)
This method return the last fetch time of the CrawlDatum
|
long |
FetchSchedule.calculateLastFetchTime(WebPage page)
Calculates last fetch time of the given CrawlDatum.
|
void |
AbstractFetchSchedule.forceRefetch(java.lang.String url,
WebPage page,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime,
retriesSinceFetch and page signature, so that it forces refetching.
|
void |
FetchSchedule.forceRefetch(java.lang.String url,
WebPage row,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and page
signature, so that it forces refetching.
|
int |
URLPartitioner.SelectorEntryPartitioner.getPartition(GeneratorJob.SelectorEntry selectorEntry,
WebPage page,
int numReduces) |
void |
AbstractFetchSchedule.initializeSchedule(java.lang.String url,
WebPage page)
Initialize fetch schedule related data.
|
void |
FetchSchedule.initializeSchedule(java.lang.String url,
WebPage page)
Initialize fetch schedule related data.
|
void |
GeneratorMapper.map(java.lang.String reversedUrl,
WebPage page,
Mapper.Context context) |
protected void |
WebTableReader.WebTableStatMapper.map(java.lang.String key,
WebPage value,
Mapper.Context context) |
protected void |
WebTableReader.WebTableRegexMapper.map(java.lang.String key,
WebPage value,
Mapper.Context context) |
void |
DbUpdateMapper.map(java.lang.String key,
WebPage page,
Mapper.Context context) |
void |
URLWebPage.setDatum(WebPage datum) |
void |
AbstractFetchSchedule.setFetchSchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the
fetchInterval and fetchTime on a
successfully fetched page. |
void |
FetchSchedule.setFetchSchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the
fetchInterval and fetchTime on a
successfully fetched page. |
void |
DefaultFetchSchedule.setFetchSchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state) |
void |
AdaptiveFetchSchedule.setFetchSchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state) |
void |
AbstractFetchSchedule.setPageGoneSchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE.
|
void |
FetchSchedule.setPageGoneSchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE.
|
void |
AbstractFetchSchedule.setPageRetrySchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due
to transient errors.
|
void |
FetchSchedule.setPageRetrySchedule(java.lang.String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due
to transient errors.
|
boolean |
AbstractFetchSchedule.shouldFetch(java.lang.String url,
WebPage page,
long curTime)
This method provides information whether the page is suitable for selection
in the current fetchlist.
|
boolean |
FetchSchedule.shouldFetch(java.lang.String url,
WebPage page,
long curTime)
This method provides information whether the page is suitable for selection
in the current fetchlist.
|
Modifier and Type | Method and Description |
---|---|
protected void |
GeneratorReducer.reduce(GeneratorJob.SelectorEntry key,
java.lang.Iterable<WebPage> values,
Reducer.Context context) |
Constructor and Description |
---|
URLWebPage(java.lang.String url,
WebPage datum) |
Modifier and Type | Method and Description |
---|---|
WebPage |
FetchEntry.getWebPage() |
Modifier and Type | Method and Description |
---|---|
protected void |
FetcherJob.FetcherMapper.map(java.lang.String key,
WebPage page,
Mapper.Context context) |
Constructor and Description |
---|
FetchEntry(Configuration conf,
java.lang.String key,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
protected void |
HostDbUpdateJob.Mapper.map(java.lang.String key,
WebPage value,
Mapper.Context context) |
Modifier and Type | Method and Description |
---|---|
protected void |
HostDbUpdateReducer.reduce(Text key,
java.lang.Iterable<WebPage> values,
Reducer.Context context) |
Modifier and Type | Field and Description |
---|---|
org.apache.gora.store.DataStore<java.lang.String,WebPage> |
IndexingJob.IndexerMapper.store |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
IndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
NutchDocument |
IndexingFilters.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
Run all defined filters.
|
NutchDocument |
IndexUtil.index(java.lang.String key,
WebPage page)
Index a
WebPage , here we add the following fields:
id: default uniqueKey for the NutchDocument .
digest: Digest is used to identify pages (like unique ID) and
is used to remove duplicates during the dedup procedure. |
void |
IndexingJob.IndexerMapper.map(java.lang.String key,
WebPage page,
Mapper.Context context) |
void |
CleaningJob.CleanMapper.map(java.lang.String key,
WebPage page,
Mapper.Context context) |
boolean |
IndexCleaningFilters.remove(java.lang.String url,
WebPage page)
Run all defined filters.
|
boolean |
IndexCleaningFilter.remove(java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
void |
CleaningJob.CleanReducer.reduce(java.lang.String key,
java.lang.Iterable<WebPage> values,
Reducer.Context context) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
AnchorIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
The
AnchorIndexingFilter filter object which supports boolean
configuration settings for the deduplication of anchors. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
BasicIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
The
BasicIndexingFilter filter object which supports boolean
configurable value for length of characters permitted within the title @see
indexer.max.title.length in nutch-default.xml |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
HtmlIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
JsoupIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MetadataIndexer.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MoreIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
SubcollectionIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
TLDIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
RelTagIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page)
The
RelTagIndexingFilter filter object. |
Parse |
RelTagParser.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
static boolean |
URLFilters.isSitemap(WebPage page)
If the page is a sitemap, return true
|
Modifier and Type | Method and Description |
---|---|
Parse |
ParseFilter.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Adds metadata or otherwise modifies a parse, given the DOM tree of a page.
|
Parse |
ParseFilters.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Run all defined filters.
|
Parse |
Parser.getParse(java.lang.String url,
WebPage page)
This method parses content in WebPage instance
|
NutchSitemapParse |
NutchSitemapParser.getParse(java.lang.String url,
WebPage page) |
static boolean |
ParserJob.isTruncated(java.lang.String url,
WebPage page)
Checks if the page's content is truncated.
|
void |
ParserJob.ParserMapper.map(java.lang.String key,
WebPage page,
Mapper.Context context) |
Parse |
ParseUtil.parse(java.lang.String url,
WebPage page)
|
void |
ParseUtil.process(java.lang.String url,
WebPage page)
Parses given web page and stores parsed content within page.
|
void |
ParseUtil.processSitemapParse(java.lang.String url,
WebPage page,
Mapper.Context context)
Parses given sitemap page and stores parsed content within page.
|
boolean |
ParseUtil.status(java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
Parse |
HtmlParser.getParse(java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
Parse |
JSParseFilter.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Scan the JavaScript looking for possible
Outlink 's |
Parse |
JSParseFilter.getParse(java.lang.String url,
WebPage page)
Parse a JavaScript file and extract outlinks
|
Modifier and Type | Method and Description |
---|---|
Parse |
JsoupHtmlParser.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
Parse |
MetaTagsParser.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
Parse |
TikaParser.getParse(java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
ProtocolOutput |
Protocol.getProtocolOutput(java.lang.String url,
WebPage page) |
crawlercommons.robots.BaseRobotRules |
Protocol.getRobotRules(java.lang.String url,
WebPage page)
Retrieve robot rules applicable for this url.
|
Modifier and Type | Method and Description |
---|---|
ProtocolOutput |
File.getProtocolOutput(java.lang.String url,
WebPage page)
Creates a
FileResponse object corresponding to the url and return a
ProtocolOutput object as per the content received |
crawlercommons.robots.BaseRobotRules |
File.getRobotRules(java.lang.String url,
WebPage page)
No robots parsing is done for file protocol.
|
Constructor and Description |
---|
FileResponse(java.net.URL url,
WebPage page,
File file,
Configuration conf) |
Modifier and Type | Method and Description |
---|---|
ProtocolOutput |
Ftp.getProtocolOutput(java.lang.String url,
WebPage page)
Creates a
FtpResponse object corresponding to the url and returns a
ProtocolOutput object as per the content received |
crawlercommons.robots.BaseRobotRules |
Ftp.getRobotRules(java.lang.String url,
WebPage page)
Get the robots rules for a given url
|
Constructor and Description |
---|
FtpResponse(java.net.URL url,
WebPage page,
Ftp ftp,
Configuration conf) |
Modifier and Type | Method and Description |
---|---|
protected Response |
Http.getResponse(java.net.URL url,
WebPage page,
boolean redirect) |
Constructor and Description |
---|
HttpResponse(HttpBase http,
java.net.URL url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
ProtocolOutput |
HttpBase.getProtocolOutput(java.lang.String url,
WebPage page) |
protected abstract Response |
HttpBase.getResponse(java.net.URL url,
WebPage page,
boolean followRedirects) |
crawlercommons.robots.BaseRobotRules |
HttpBase.getRobotRules(java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
protected Response |
Http.getResponse(java.net.URL url,
WebPage page,
boolean redirect)
Fetches the
url with a configured HTTP client and gets the
response. |
Modifier and Type | Method and Description |
---|---|
ProtocolOutput |
Sftp.getProtocolOutput(java.lang.String url,
WebPage page) |
crawlercommons.robots.BaseRobotRules |
Sftp.getRobotRules(java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
void |
ScoringFilter.distributeScoreToOutlinks(java.lang.String fromUrl,
WebPage page,
java.util.Collection<ScoreDatum> scoreData,
int allCount)
Distribute score value from the current page to all its outlinked pages.
|
void |
ScoringFilters.distributeScoreToOutlinks(java.lang.String fromUrl,
WebPage row,
java.util.Collection<ScoreDatum> scoreData,
int allCount) |
float |
ScoringFilter.generatorSortValue(java.lang.String url,
WebPage page,
float initSort)
This method prepares a sort value for the purpose of sorting and selecting
top N scoring pages during fetchlist generation.
|
float |
ScoringFilters.generatorSortValue(java.lang.String url,
WebPage row,
float initSort)
Calculate a sort value for Generate.
|
float |
ScoringFilter.indexerScore(java.lang.String url,
NutchDocument doc,
WebPage page,
float initScore)
This method calculates a Lucene document boost.
|
float |
ScoringFilters.indexerScore(java.lang.String url,
NutchDocument doc,
WebPage row,
float initScore) |
void |
ScoringFilter.initialScore(java.lang.String url,
WebPage page)
Set an initial score for newly discovered pages.
|
void |
ScoringFilters.initialScore(java.lang.String url,
WebPage row)
Calculate a new initial score, used when adding newly discovered pages.
|
void |
ScoringFilter.injectedScore(java.lang.String url,
WebPage page)
Set an initial score for newly injected pages.
|
void |
ScoringFilters.injectedScore(java.lang.String url,
WebPage row)
Calculate a new initial score, used when injecting new pages.
|
void |
ScoringFilter.updateScore(java.lang.String url,
WebPage page,
java.util.List<ScoreDatum> inlinkedScoreData)
This method calculates a new score during table update, based on the values
contributed by inlinked pages.
|
void |
ScoringFilters.updateScore(java.lang.String url,
WebPage row,
java.util.List<ScoreDatum> inlinkedScoreData) |
Modifier and Type | Method and Description |
---|---|
void |
LinkAnalysisScoringFilter.distributeScoreToOutlinks(java.lang.String fromUrl,
WebPage page,
java.util.Collection<ScoreDatum> scoreData,
int allCount) |
float |
LinkAnalysisScoringFilter.generatorSortValue(java.lang.String url,
WebPage page,
float initSort) |
float |
LinkAnalysisScoringFilter.indexerScore(java.lang.String url,
NutchDocument doc,
WebPage page,
float initScore) |
void |
LinkAnalysisScoringFilter.initialScore(java.lang.String url,
WebPage page) |
void |
LinkAnalysisScoringFilter.injectedScore(java.lang.String url,
WebPage page) |
void |
LinkAnalysisScoringFilter.updateScore(java.lang.String url,
WebPage page,
java.util.List<ScoreDatum> inlinkedScoreData) |
Modifier and Type | Method and Description |
---|---|
void |
OPICScoringFilter.distributeScoreToOutlinks(java.lang.String fromUrl,
WebPage row,
java.util.Collection<ScoreDatum> scoreData,
int allCount)
Get cash on hand, divide it by the number of outlinks and apply.
|
float |
OPICScoringFilter.generatorSortValue(java.lang.String url,
WebPage row,
float initSort)
Use
getScore() . |
float |
OPICScoringFilter.indexerScore(java.lang.String url,
NutchDocument doc,
WebPage row,
float initScore)
Dampen the boost value by scorePower.
|
void |
OPICScoringFilter.initialScore(java.lang.String url,
WebPage row)
Set to 0.0f (unknown value) - inlink contributions will bring it to a
correct level.
|
void |
OPICScoringFilter.injectedScore(java.lang.String url,
WebPage row) |
void |
OPICScoringFilter.updateScore(java.lang.String url,
WebPage row,
java.util.List<ScoreDatum> inlinkedScoreData)
Increase the score by a sum of inlinked scores.
|
Modifier and Type | Method and Description |
---|---|
void |
TLDScoringFilter.distributeScoreToOutlinks(java.lang.String fromUrl,
WebPage page,
java.util.Collection<ScoreDatum> scoreData,
int allCount) |
float |
TLDScoringFilter.generatorSortValue(java.lang.String url,
WebPage page,
float initSort) |
float |
TLDScoringFilter.indexerScore(java.lang.String url,
NutchDocument doc,
WebPage page,
float initScore) |
void |
TLDScoringFilter.initialScore(java.lang.String url,
WebPage page) |
void |
TLDScoringFilter.injectedScore(java.lang.String url,
WebPage page) |
void |
TLDScoringFilter.updateScore(java.lang.String url,
WebPage page,
java.util.List<ScoreDatum> inlinkedScoreData) |
Modifier and Type | Class and Description |
---|---|
static class |
WebPage.Tombstone |
Modifier and Type | Method and Description |
---|---|
WebPage |
WebPage.Builder.build() |
WebPage |
WebPage.newInstance() |
Modifier and Type | Method and Description |
---|---|
org.apache.avro.util.Utf8 |
Mark.checkMark(WebPage page) |
static WebPage.Builder |
WebPage.newBuilder(WebPage other)
Creates a new WebPage RecordBuilder by copying an existing WebPage instance
|
void |
Mark.putMark(WebPage page,
java.lang.String markValue) |
void |
Mark.putMark(WebPage page,
org.apache.avro.util.Utf8 markValue) |
org.apache.avro.util.Utf8 |
Mark.removeMark(WebPage page) |
org.apache.avro.util.Utf8 |
Mark.removeMarkIfExist(WebPage page)
Remove the mark only if the mark is present on the page.
|
Modifier and Type | Method and Description |
---|---|
static <K,V> void |
StorageUtils.initMapperJob(Job job,
java.util.Collection<WebPage.Field> fields,
java.lang.Class<K> outKeyClass,
java.lang.Class<V> outValueClass,
java.lang.Class<? extends org.apache.gora.mapreduce.GoraMapper<java.lang.String,WebPage,K,V>> mapperClass) |
static <K,V> void |
StorageUtils.initMapperJob(Job job,
java.util.Collection<WebPage.Field> fields,
java.lang.Class<K> outKeyClass,
java.lang.Class<V> outValueClass,
java.lang.Class<? extends org.apache.gora.mapreduce.GoraMapper<java.lang.String,WebPage,K,V>> mapperClass,
java.lang.Class<? extends Partitioner<K,V>> partitionerClass) |
static <K,V> void |
StorageUtils.initMapperJob(Job job,
java.util.Collection<WebPage.Field> fields,
java.lang.Class<K> outKeyClass,
java.lang.Class<V> outValueClass,
java.lang.Class<? extends org.apache.gora.mapreduce.GoraMapper<java.lang.String,WebPage,K,V>> mapperClass,
java.lang.Class<? extends Partitioner<K,V>> partitionerClass,
boolean reuseObjects) |
static <K,V> void |
StorageUtils.initMapperJob(Job job,
java.util.Collection<WebPage.Field> fields,
java.lang.Class<K> outKeyClass,
java.lang.Class<V> outValueClass,
java.lang.Class<? extends org.apache.gora.mapreduce.GoraMapper<java.lang.String,WebPage,K,V>> mapperClass,
java.lang.Class<? extends Partitioner<K,V>> partitionerClass,
org.apache.gora.filter.Filter<java.lang.String,WebPage> filter,
boolean reuseObjects) |
static <K,V> void |
StorageUtils.initMapperJob(Job job,
java.util.Collection<WebPage.Field> fields,
java.lang.Class<K> outKeyClass,
java.lang.Class<V> outValueClass,
java.lang.Class<? extends org.apache.gora.mapreduce.GoraMapper<java.lang.String,WebPage,K,V>> mapperClass,
java.lang.Class<? extends Partitioner<K,V>> partitionerClass,
org.apache.gora.filter.Filter<java.lang.String,WebPage> filter,
boolean reuseObjects) |
static <K,V> void |
StorageUtils.initMapperJob(Job job,
java.util.Collection<WebPage.Field> fields,
java.lang.Class<K> outKeyClass,
java.lang.Class<V> outValueClass,
java.lang.Class<? extends org.apache.gora.mapreduce.GoraMapper<java.lang.String,WebPage,K,V>> mapperClass,
org.apache.gora.filter.Filter<java.lang.String,WebPage> filter) |
static <K,V> void |
StorageUtils.initMapperJob(Job job,
java.util.Collection<WebPage.Field> fields,
java.lang.Class<K> outKeyClass,
java.lang.Class<V> outValueClass,
java.lang.Class<? extends org.apache.gora.mapreduce.GoraMapper<java.lang.String,WebPage,K,V>> mapperClass,
org.apache.gora.filter.Filter<java.lang.String,WebPage> filter) |
static <K,V> void |
StorageUtils.initReducerJob(Job job,
java.lang.Class<? extends org.apache.gora.mapreduce.GoraReducer<K,V,java.lang.String,WebPage>> reducerClass) |
Modifier and Type | Method and Description |
---|---|
WebPage |
WebPageWritable.getWebPage() |
Modifier and Type | Method and Description |
---|---|
void |
EncodingDetector.autoDetectClues(WebPage page,
boolean filter) |
java.lang.String |
EncodingDetector.guessEncoding(WebPage page,
java.lang.String defaultValue)
Guess the encoding with the previously specified list of clues.
|
void |
WebPageWritable.setWebPage(WebPage webPage) |
Modifier and Type | Method and Description |
---|---|
protected void |
IdentityPageReducer.reduce(java.lang.String key,
java.lang.Iterable<WebPage> values,
Reducer.Context context) |
Constructor and Description |
---|
WebPageWritable(Configuration conf,
WebPage webPage) |
Modifier and Type | Method and Description |
---|---|
protected void |
DomainStatistics.DomainStatisticsMapper.map(java.lang.String key,
WebPage value,
Mapper.Context context) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
CCIndexingFilter.filter(NutchDocument doc,
java.lang.String url,
WebPage page) |
Parse |
CCParseFilter.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Adds metadata or otherwise modifies a parse of an HTML document, given the
DOM tree of a page.
|
static void |
CCParseFilter.Walker.walk(org.w3c.dom.Node doc,
java.net.URL base,
WebPage page,
Configuration conf)
Scan the document adding attributes to metadata.
|
Copyright © 2019 The Apache Software Foundation