public class File extends java.lang.Object implements Protocol
FileResponse
object and gets the content of the url from it.
Configurable parameters are file.content.limit
and
file.crawl.parent
in nutch-default.xml defined under
"file properties" section.Modifier and Type | Field and Description |
---|---|
protected static org.slf4j.Logger |
LOG |
CHECK_BLOCKING, CHECK_ROBOTS, X_POINT_ID
Constructor and Description |
---|
File() |
Modifier and Type | Method and Description |
---|---|
Configuration |
getConf()
Get the
Configuration object |
java.util.Collection<WebPage.Field> |
getFields() |
ProtocolOutput |
getProtocolOutput(java.lang.String url,
WebPage page)
Creates a
FileResponse object corresponding to the url and return a
ProtocolOutput object as per the content received |
crawlercommons.robots.BaseRobotRules |
getRobotRules(java.lang.String url,
WebPage page)
No robots parsing is done for file protocol.
|
static void |
main(java.lang.String[] args)
Quick way for running this class.
|
void |
setConf(Configuration conf)
Set the
Configuration object |
void |
setMaxContentLength(int maxContentLength)
Set the point at which content is truncated.
|
public void setConf(Configuration conf)
Configuration
objectsetConf
in interface Configurable
public Configuration getConf()
Configuration
objectgetConf
in interface Configurable
public void setMaxContentLength(int maxContentLength)
public ProtocolOutput getProtocolOutput(java.lang.String url, WebPage page)
FileResponse
object corresponding to the url and return a
ProtocolOutput
object as per the content receivedgetProtocolOutput
in interface Protocol
url
- Text containing the urlpage
- WebPage
object relative to the URLProtocolOutput
object for the content of the file indicated
by urlpublic java.util.Collection<WebPage.Field> getFields()
getFields
in interface FieldPluggable
public static void main(java.lang.String[] args) throws java.lang.Exception
java.lang.Exception
public crawlercommons.robots.BaseRobotRules getRobotRules(java.lang.String url, WebPage page)
getRobotRules
in interface Protocol
url
- url to checkCopyright © 2019 The Apache Software Foundation