public class FetcherJob extends NutchTool implements Tool
Modifier and Type | Class and Description |
---|---|
static class |
FetcherJob.FetcherMapper
Mapper class for Fetcher.
|
Modifier and Type | Field and Description |
---|---|
protected static org.slf4j.Logger |
LOG |
static java.lang.String |
PARSE_KEY |
static int |
PERM_REFRESH_TIME |
static java.lang.String |
PROTOCOL_REDIR |
static org.apache.avro.util.Utf8 |
REDIRECT_DISCOVERED |
static java.lang.String |
RESUME_KEY |
static java.lang.String |
SITEMAP |
static java.lang.String |
SITEMAP_DETECT |
static java.lang.String |
THREADS_KEY |
currentJob, currentJobNum, numJobs, results, status
Constructor and Description |
---|
FetcherJob() |
FetcherJob(Configuration conf) |
Modifier and Type | Method and Description |
---|---|
int |
fetch(java.lang.String batchId,
int threads,
boolean shouldResume,
int numTasks)
Run fetcher.
|
int |
fetch(java.lang.String batchId,
int threads,
boolean shouldResume,
int numTasks,
boolean stmDetect,
boolean sitemap)
Run fetcher.
|
java.util.Collection<WebPage.Field> |
getFields(Job job) |
static void |
main(java.lang.String[] args) |
java.util.Map<java.lang.String,java.lang.Object> |
run(java.util.Map<java.lang.String,java.lang.Object> args)
Runs the tool, using a map of arguments.
|
int |
run(java.lang.String[] args) |
getProgress, getStatus, killJob, stopJob
getConf, setConf
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getConf, setConf
public static final java.lang.String PROTOCOL_REDIR
public static final int PERM_REFRESH_TIME
public static final org.apache.avro.util.Utf8 REDIRECT_DISCOVERED
public static final java.lang.String RESUME_KEY
public static final java.lang.String SITEMAP
public static final java.lang.String SITEMAP_DETECT
public static final java.lang.String PARSE_KEY
public static final java.lang.String THREADS_KEY
protected static final org.slf4j.Logger LOG
public FetcherJob()
public FetcherJob(Configuration conf)
public java.util.Collection<WebPage.Field> getFields(Job job)
public java.util.Map<java.lang.String,java.lang.Object> run(java.util.Map<java.lang.String,java.lang.Object> args) throws java.lang.Exception
NutchTool
public int fetch(java.lang.String batchId, int threads, boolean shouldResume, int numTasks) throws java.lang.Exception
batchId
- batchId (obtained from Generator) or null to fetch all generated
fetchliststhreads
- number of threads per map taskshouldResume
- numTasks
- number of fetching tasks (reducers). If set to < 1 then use the
default, which is mapreduce.job.reduces.java.lang.Exception
public int fetch(java.lang.String batchId, int threads, boolean shouldResume, int numTasks, boolean stmDetect, boolean sitemap) throws java.lang.Exception
batchId
- batchId (obtained from Generator) or null to fetch all generated
fetchliststhreads
- number of threads per map taskshouldResume
- numTasks
- number of fetching tasks (reducers). If set to < 1 then use the
default, which is mapreduce.job.reduces.stmDetect
- If set true, sitemap detection is run.sitemap
- If set true, only sitemap files is fetched, If set false, only
normal urls is fetched.java.lang.Exception
public int run(java.lang.String[] args) throws java.lang.Exception
public static void main(java.lang.String[] args) throws java.lang.Exception
java.lang.Exception
Copyright © 2019 The Apache Software Foundation