Package | Description |
---|---|
org.apache.nutch.analysis.lang |
Text document language identifier.
|
org.apache.nutch.microformats.reltag |
A microformats Rel-Tag
Parser/Indexer/Querier plugin.
|
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.parse.html |
An HTML document parsing plugin.
|
org.apache.nutch.parse.js |
Parser and parse filter plugin to extract all (possible) links
from JavaScript files and embedded JavaScript code snippets.
|
org.apache.nutch.parse.jsoup.extractor |
Parse filter based on Jsoup
|
org.apache.nutch.parse.metatags |
Parse filter to extract meta tags: keywords, description, etc.
|
org.apache.nutch.parse.tika |
Parse various document formats with help of
Apache Tika.
|
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons medadata.
|
Modifier and Type | Method and Description |
---|---|
Parse |
HTMLLanguageParser.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Scan the HTML document looking at possible indications of content language
html lang attribute (http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1) meta dc.language (http://dublincore.org/documents/2000/07/16/usageguide/qualified -html.shtml#language) meta http-equiv (content-language) (http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.4.2) |
Modifier and Type | Method and Description |
---|---|
Parse |
HTMLLanguageParser.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Scan the HTML document looking at possible indications of content language
html lang attribute (http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1) meta dc.language (http://dublincore.org/documents/2000/07/16/usageguide/qualified -html.shtml#language) meta http-equiv (content-language) (http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.4.2) |
Modifier and Type | Method and Description |
---|---|
Parse |
RelTagParser.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
Parse |
RelTagParser.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
Parse |
ParseFilter.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Adds metadata or otherwise modifies a parse, given the DOM tree of a page.
|
Parse |
ParseFilters.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Run all defined filters.
|
static Parse |
ParseStatusUtils.getEmptyParse(java.lang.Exception e,
Configuration conf) |
static Parse |
ParseStatusUtils.getEmptyParse(int minorCode,
java.lang.String message,
Configuration conf) |
Parse |
Parser.getParse(java.lang.String url,
WebPage page)
This method parses content in WebPage instance
|
Parse |
ParseUtil.parse(java.lang.String url,
WebPage page)
|
Modifier and Type | Method and Description |
---|---|
Parse |
ParseFilter.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Adds metadata or otherwise modifies a parse, given the DOM tree of a page.
|
Parse |
ParseFilters.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Run all defined filters.
|
Modifier and Type | Method and Description |
---|---|
Parse |
HtmlParser.getParse(java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
Parse |
JSParseFilter.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Scan the JavaScript looking for possible
Outlink 's |
Parse |
JSParseFilter.getParse(java.lang.String url,
WebPage page)
Parse a JavaScript file and extract outlinks
|
Modifier and Type | Method and Description |
---|---|
Parse |
JSParseFilter.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Scan the JavaScript looking for possible
Outlink 's |
Modifier and Type | Method and Description |
---|---|
Parse |
JsoupHtmlParser.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
Parse |
JsoupHtmlParser.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
Parse |
MetaTagsParser.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
Parse |
MetaTagsParser.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
Parse |
TikaParser.getParse(java.lang.String url,
WebPage page) |
Modifier and Type | Method and Description |
---|---|
Parse |
CCParseFilter.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Adds metadata or otherwise modifies a parse of an HTML document, given the
DOM tree of a page.
|
Modifier and Type | Method and Description |
---|---|
Parse |
CCParseFilter.filter(java.lang.String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
org.w3c.dom.DocumentFragment doc)
Adds metadata or otherwise modifies a parse of an HTML document, given the
DOM tree of a page.
|
Copyright © 2019 The Apache Software Foundation