xsmeral.semnet.crawler
Class HTMLCrawler
java.lang.Object
xsmeral.pipe.AbstractObjectProcessor
xsmeral.pipe.LocalObjectSource<EntityDocument>
xsmeral.semnet.crawler.HTMLCrawler
- All Implemented Interfaces:
- Runnable, ContextAware, ObjectProcessor, ObjectSource<EntityDocument>
@ObjectProcessorInterface(out=EntityDocument.class)
public class HTMLCrawler
- extends LocalObjectSource<EntityDocument>
A web crawler of HTML pages. Crawls configured hosts, looking for links matching specified patterns.
Documents at the matched URLs are passed to a scraper (AbstractScraper
),
contained in EntityDocument
s.
Scrapers work in co-operation with the crawler, using the same configuration.
A persistent state is maintained using URLManager
and HostManager
,
enabling the crawler to be stopped and restarted at any time.
Configuration
The crawler is configured with a CrawlerConfiguration
which contains
HostDescriptor
s, that describe the crawling targets. The crawling
is focused on entities represented by URLs of the host.
Entities are described by EntityDescriptor
s.
The database configuration for state persistence, internally represented by a
RDBLayer
class is stored in the crawler configuration as well.
These configuration files are stored in XML files. An external library
(XStream) is used for XML
(de)serialization.
Bootstrapping
Certain starting points (URLs) need to be specified to seed the crawler.
These are supplied as a list of absolute URLs contained in a single file,
one URL per line. This file should be either placed in the same directory
as the crawler configuration and named the value of DEF_BOOTSTRAP_FILE
or some other file should be supplied in the bootstrap
initialization
parameter.
All URLs in the file need to have their corresponding hosts already defined
in the configuration, otherwise they are ignored.
After URLs from the file have been succesfully read and added to DB, the file
is renamed.
Crawling
One instance of the crawler crawls multiple hosts at the same time with one
or more threads per host. An implementation of the Robots Exclusion standard
is provided in class RobotsPolicy
which allows the crawler to obey the
crawling rules defined by the target host ((dis)allowed URL patterns and
crawling delay). Adherence to the rules is optional.
Retrieved web pages are decoded using the character encoding determined/guessed
by the class CharsetDetector
and parsed using a third-party library
HtmlCleaner. The library is
used as a compensatory measure for the multitude of web pages that are non-valid.
Another compensatory measure is the use of URL normalization to ensure
consistent representation of URLs, provided by
URLUtil.normalize(URL)
.
Consistent HTTP connection settings are provided by auxiliary class
ConnectionManager
.
- Initialization parameters
conf
- Crawler configuration file namebootstrap
- (Optional) Name of file containg list of URLs (one per line) to load to database prior to running
- Context parameters supplied
hostManager
- A HostManager instance initialized with hosts from
crawler configuration.
Methods inherited from class xsmeral.pipe.AbstractObjectProcessor |
canStart, failStart, failStart, failStart, getContext, getInType, getOutType, getParams, getStatus, initContext, initContextSet, initialize, initializeInternal, postRun, preRun, requestStop, setContext, stop, toString |
DEF_BOOTSTRAP_FILE
public static final String DEF_BOOTSTRAP_FILE
- See Also:
- Constant Field Values
BOOTSTRAP_OLD_SUFFIX
public static final String BOOTSTRAP_OLD_SUFFIX
- See Also:
- Constant Field Values
CONNECTION_RETRIES
public static final int CONNECTION_RETRIES
- See Also:
- Constant Field Values
HTMLCrawler
public HTMLCrawler()
HTMLCrawler
public HTMLCrawler(CrawlerConfiguration crawlerConf)
- Initializes the state using the supplied configuration
- Parameters:
crawlerConf
- Crawler configuration
ignoresPolicy
public boolean ignoresPolicy()
- See Also:
CrawlerConfiguration.isPolicyIgnored()
setIgnoresPolicy
public void setIgnoresPolicy(boolean ignoresPolicy)
- See Also:
CrawlerConfiguration.setPolicyIgnored(boolean)
setGlobalCrawlDelayMinimum
public void setGlobalCrawlDelayMinimum(int globalCrawlDelayMinimum)
- See Also:
CrawlerConfiguration.setGlobalCrawlDelayMinimum(int)
getGlobalCrawlDelayMinimum
public int getGlobalCrawlDelayMinimum()
- See Also:
CrawlerConfiguration.getGlobalCrawlDelayMinimum()
isFakeReferrer
public boolean isFakeReferrer()
- See Also:
CrawlerConfiguration.isFakeReferrer()
setFakeReferrer
public void setFakeReferrer(boolean fakeReferrer)
- See Also:
CrawlerConfiguration.setFakeReferrer(boolean)
bootstrapURLs
public int bootstrapURLs(Collection<String> urls)
- Adds valid (matched by a defined pattern) URLs from the supplied collection to the DB.
- Parameters:
urls
- The list of URLs to add
- Returns:
- Number of valid URLs added to DB
- See Also:
EntityDescriptor
,
HostDescriptor
bootstrapFromFile
public void bootstrapFromFile(File bootFile)
- Reads a file containing list of URL and calls
bootStrapURLs
.
initWithContext
protected void initWithContext()
- Deserializes crawler configuration from XML and initializes crawler state
- Overrides:
initWithContext
in class AbstractObjectProcessor
initPostContext
protected void initPostContext()
- Overrides:
initPostContext
in class AbstractObjectProcessor
run
public void run()
- Starts the crawling threads, waits for all to die, then stops.
Looks for a bootstrap file which should contain a list of URLs to bootstrap
the crawler. Once the list is read, it is renamed, so that the same URLs
won't be read again.
Also, before starting, checks for locked URLs (from previous run)
and unlocks them.
- Specified by:
run
in interface Runnable
- Specified by:
run
in interface ObjectProcessor
- Overrides:
run
in class AbstractObjectProcessor