xsmeral.semnet.crawler
Class HTMLCrawler

java.lang.Object
  extended by xsmeral.pipe.AbstractObjectProcessor
      extended by xsmeral.pipe.LocalObjectSource<EntityDocument>
          extended by xsmeral.semnet.crawler.HTMLCrawler
All Implemented Interfaces:
Runnable, ContextAware, ObjectProcessor, ObjectSource<EntityDocument>

@ObjectProcessorInterface(out=EntityDocument.class)
public class HTMLCrawler
extends LocalObjectSource<EntityDocument>

A web crawler of HTML pages. Crawls configured hosts, looking for links matching specified patterns.
Documents at the matched URLs are passed to a scraper (AbstractScraper), contained in EntityDocuments. Scrapers work in co-operation with the crawler, using the same configuration. A persistent state is maintained using URLManager and HostManager, enabling the crawler to be stopped and restarted at any time.

Configuration

The crawler is configured with a CrawlerConfiguration which contains HostDescriptors, that describe the crawling targets. The crawling is focused on entities represented by URLs of the host. Entities are described by EntityDescriptors.
The database configuration for state persistence, internally represented by a RDBLayer class is stored in the crawler configuration as well. These configuration files are stored in XML files. An external library (XStream) is used for XML (de)serialization.

Bootstrapping

Certain starting points (URLs) need to be specified to seed the crawler. These are supplied as a list of absolute URLs contained in a single file, one URL per line. This file should be either placed in the same directory as the crawler configuration and named the value of DEF_BOOTSTRAP_FILE or some other file should be supplied in the bootstrap initialization parameter.
All URLs in the file need to have their corresponding hosts already defined in the configuration, otherwise they are ignored.
After URLs from the file have been succesfully read and added to DB, the file is renamed.

Crawling

One instance of the crawler crawls multiple hosts at the same time with one or more threads per host. An implementation of the Robots Exclusion standard is provided in class RobotsPolicy which allows the crawler to obey the crawling rules defined by the target host ((dis)allowed URL patterns and crawling delay). Adherence to the rules is optional.
Retrieved web pages are decoded using the character encoding determined/guessed by the class CharsetDetector and parsed using a third-party library HtmlCleaner. The library is used as a compensatory measure for the multitude of web pages that are non-valid. Another compensatory measure is the use of URL normalization to ensure consistent representation of URLs, provided by URLUtil.normalize(URL).
Consistent HTTP connection settings are provided by auxiliary class ConnectionManager.


Initialization parameters
conf - Crawler configuration file name
bootstrap - (Optional) Name of file containg list of URLs (one per line) to load to database prior to running

Context parameters supplied
hostManager - A HostManager instance initialized with hosts from crawler configuration.

Nested Class Summary
 
Nested classes/interfaces inherited from interface xsmeral.pipe.interfaces.ObjectProcessor
ObjectProcessor.Status
 
Field Summary
static String BOOTSTRAP_OLD_SUFFIX
           
static int CONNECTION_RETRIES
           
static String DEF_BOOTSTRAP_FILE
           
 
Fields inherited from class xsmeral.pipe.LocalObjectSource
next, outBuffer, outBufferCapacity
 
Fields inherited from class xsmeral.pipe.AbstractObjectProcessor
canStart, context, status
 
Constructor Summary
HTMLCrawler()
           
HTMLCrawler(CrawlerConfiguration crawlerConf)
          Initializes the state using the supplied configuration
 
Method Summary
 void bootstrapFromFile(File bootFile)
          Reads a file containing list of URL and calls bootStrapURLs.
 int bootstrapURLs(Collection<String> urls)
          Adds valid (matched by a defined pattern) URLs from the supplied collection to the DB.
 int getGlobalCrawlDelayMinimum()
           
 boolean ignoresPolicy()
           
protected  void initPostContext()
           
protected  void initWithContext()
          Deserializes crawler configuration from XML and initializes crawler state
 boolean isFakeReferrer()
           
 void run()
          Starts the crawling threads, waits for all to die, then stops.
 void setFakeReferrer(boolean fakeReferrer)
           
 void setGlobalCrawlDelayMinimum(int globalCrawlDelayMinimum)
           
 void setIgnoresPolicy(boolean ignoresPolicy)
           
 
Methods inherited from class xsmeral.pipe.LocalObjectSource
getNext, getOutBuffer, handleStoppedSink, next, process, setNext, setOutBuffer, write
 
Methods inherited from class xsmeral.pipe.AbstractObjectProcessor
canStart, failStart, failStart, failStart, getContext, getInType, getOutType, getParams, getStatus, initContext, initContextSet, initialize, initializeInternal, postRun, preRun, requestStop, setContext, stop, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface xsmeral.pipe.interfaces.ObjectSource
getOutType
 

Field Detail

DEF_BOOTSTRAP_FILE

public static final String DEF_BOOTSTRAP_FILE
See Also:
Constant Field Values

BOOTSTRAP_OLD_SUFFIX

public static final String BOOTSTRAP_OLD_SUFFIX
See Also:
Constant Field Values

CONNECTION_RETRIES

public static final int CONNECTION_RETRIES
See Also:
Constant Field Values
Constructor Detail

HTMLCrawler

public HTMLCrawler()

HTMLCrawler

public HTMLCrawler(CrawlerConfiguration crawlerConf)
Initializes the state using the supplied configuration

Parameters:
crawlerConf - Crawler configuration
Method Detail

ignoresPolicy

public boolean ignoresPolicy()
See Also:
CrawlerConfiguration.isPolicyIgnored()

setIgnoresPolicy

public void setIgnoresPolicy(boolean ignoresPolicy)
See Also:
CrawlerConfiguration.setPolicyIgnored(boolean)

setGlobalCrawlDelayMinimum

public void setGlobalCrawlDelayMinimum(int globalCrawlDelayMinimum)
See Also:
CrawlerConfiguration.setGlobalCrawlDelayMinimum(int)

getGlobalCrawlDelayMinimum

public int getGlobalCrawlDelayMinimum()
See Also:
CrawlerConfiguration.getGlobalCrawlDelayMinimum()

isFakeReferrer

public boolean isFakeReferrer()
See Also:
CrawlerConfiguration.isFakeReferrer()

setFakeReferrer

public void setFakeReferrer(boolean fakeReferrer)
See Also:
CrawlerConfiguration.setFakeReferrer(boolean)

bootstrapURLs

public int bootstrapURLs(Collection<String> urls)
Adds valid (matched by a defined pattern) URLs from the supplied collection to the DB.

Parameters:
urls - The list of URLs to add
Returns:
Number of valid URLs added to DB
See Also:
EntityDescriptor, HostDescriptor

bootstrapFromFile

public void bootstrapFromFile(File bootFile)
Reads a file containing list of URL and calls bootStrapURLs.


initWithContext

protected void initWithContext()
Deserializes crawler configuration from XML and initializes crawler state

Overrides:
initWithContext in class AbstractObjectProcessor

initPostContext

protected void initPostContext()
Overrides:
initPostContext in class AbstractObjectProcessor

run

public void run()
Starts the crawling threads, waits for all to die, then stops. Looks for a bootstrap file which should contain a list of URLs to bootstrap the crawler. Once the list is read, it is renamed, so that the same URLs won't be read again. Also, before starting, checks for locked URLs (from previous run) and unlocks them.

Specified by:
run in interface Runnable
Specified by:
run in interface ObjectProcessor
Overrides:
run in class AbstractObjectProcessor