public class HttpRobotRulesParser extends RobotRulesParser
RobotRulesParser
class and contains Http protocol
specific implementation for obtaining the robots file.Modifier and Type | Field and Description |
---|---|
protected boolean |
allowForbidden |
agentNames, CACHE, EMPTY_RULES, FORBID_ALL_RULES
Constructor and Description |
---|
HttpRobotRulesParser(Configuration conf) |
Modifier and Type | Method and Description |
---|---|
protected static java.lang.String |
getCacheKey(java.net.URL url)
Compose unique key to store and access robot rules in cache for given URL
|
crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol http,
java.net.URL url)
Get the rules from robots.txt which applies for the given
url . |
getConf, getRobotRulesSet, main, parseRules, setConf
public HttpRobotRulesParser(Configuration conf)
protected static java.lang.String getCacheKey(java.net.URL url)
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http, java.net.URL url)
url
.
Robot rules are cached for a unique combination of host, protocol, and
port. If no rules are found in the cache, a HTTP request is send to fetch
{{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the
rules are cached to avoid re-fetching and re-parsing it again.getRobotRulesSet
in class RobotRulesParser
http
- The Protocol
objecturl
- URL robots.txt applies toBaseRobotRules
holding the rules from robots.txtCopyright © 2019 The Apache Software Foundation