HTMLCrawler

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

xsmeral.semnet.crawler
Class HTMLCrawler

java.lang.Object
  xsmeral.pipe.AbstractObjectProcessor
      xsmeral.pipe.LocalObjectSource<EntityDocument>
          xsmeral.semnet.crawler.HTMLCrawler

All Implemented Interfaces:: Runnable, ContextAware, ObjectProcessor, ObjectSource<EntityDocument>

@ObjectProcessorInterface(out=EntityDocument.class) public class HTMLCrawler
extends LocalObjectSource<EntityDocument>
extends LocalObjectSource<EntityDocument>

A web crawler of HTML pages. Crawls configured hosts, looking for links matching specified patterns.
Documents at the matched URLs are passed to a scraper (AbstractScraper), contained in EntityDocuments. Scrapers work in co-operation with the crawler, using the same configuration. A persistent state is maintained using URLManager and HostManager, enabling the crawler to be stopped and restarted at any time.

Configuration

The crawler is configured with a CrawlerConfiguration which contains HostDescriptors, that describe the crawling targets. The crawling is focused on entities represented by URLs of the host. Entities are described by EntityDescriptors.
The database configuration for state persistence, internally represented by a RDBLayer class is stored in the crawler configuration as well. These configuration files are stored in XML files. An external library (XStream) is used for XML (de)serialization.

Bootstrapping

Certain starting points (URLs) need to be specified to seed the crawler. These are supplied as a list of absolute URLs contained in a single file, one URL per line. This file should be either placed in the same directory as the crawler configuration and named the value of DEF_BOOTSTRAP_FILE or some other file should be supplied in the bootstrap initialization parameter.
All URLs in the file need to have their corresponding hosts already defined in the configuration, otherwise they are ignored.
After URLs from the file have been succesfully read and added to DB, the file is renamed.

Crawling

One instance of the crawler crawls multiple hosts at the same time with one or more threads per host. An implementation of the Robots Exclusion standard is provided in class RobotsPolicy which allows the crawler to obey the crawling rules defined by the target host ((dis)allowed URL patterns and crawling delay). Adherence to the rules is optional.
Retrieved web pages are decoded using the character encoding determined/guessed by the class CharsetDetector and parsed using a third-party library HtmlCleaner. The library is used as a compensatory measure for the multitude of web pages that are non-valid. Another compensatory measure is the use of URL normalization to ensure consistent representation of URLs, provided by URLUtil.normalize(URL).
Consistent HTTP connection settings are provided by auxiliary class ConnectionManager.

Initialization parameters: conf - Crawler configuration file name; bootstrap - (Optional) Name of file containg list of URLs (one per line) to load to database prior to running
Context parameters supplied: hostManager - A HostManager instance initialized with hosts from crawler configuration.

Nested Class Summary

Nested classes/interfaces inherited from interface xsmeral.pipe.interfaces.ObjectProcessor
`ObjectProcessor.Status`

Field Summary
`static String`	`BOOTSTRAP_OLD_SUFFIX`
`static int`	`CONNECTION_RETRIES`
`static String`	`DEF_BOOTSTRAP_FILE`

Fields inherited from class xsmeral.pipe.LocalObjectSource
`next, outBuffer, outBufferCapacity`

Fields inherited from class xsmeral.pipe.AbstractObjectProcessor
`canStart, context, status`

Constructor Summary
`HTMLCrawler()`
`HTMLCrawler(CrawlerConfiguration crawlerConf)` Initializes the state using the supplied configuration

Method Summary
`void`	`bootstrapFromFile(File bootFile)` Reads a file containing list of URL and calls `bootStrapURLs`.
`int`	`bootstrapURLs(Collection<String> urls)` Adds valid (matched by a defined pattern) URLs from the supplied collection to the DB.
`int`	`getGlobalCrawlDelayMinimum()`
`boolean`	`ignoresPolicy()`
`protected void`	`initPostContext()`
`protected void`	`initWithContext()` Deserializes crawler configuration from XML and initializes crawler state
`boolean`	`isFakeReferrer()`
`void`	`run()` Starts the crawling threads, waits for all to die, then stops.
`void`	`setFakeReferrer(boolean fakeReferrer)`
`void`	`setGlobalCrawlDelayMinimum(int globalCrawlDelayMinimum)`
`void`	`setIgnoresPolicy(boolean ignoresPolicy)`

Methods inherited from class xsmeral.pipe.LocalObjectSource
`getNext, getOutBuffer, handleStoppedSink, next, process, setNext, setOutBuffer, write`

Methods inherited from class xsmeral.pipe.AbstractObjectProcessor
`canStart, failStart, failStart, failStart, getContext, getInType, getOutType, getParams, getStatus, initContext, initContextSet, initialize, initializeInternal, postRun, preRun, requestStop, setContext, stop, toString`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Methods inherited from interface xsmeral.pipe.interfaces.ObjectSource
`getOutType`

Field Detail

DEF_BOOTSTRAP_FILE

public static final String DEF_BOOTSTRAP_FILE

See Also:: Constant Field Values

BOOTSTRAP_OLD_SUFFIX

public static final String BOOTSTRAP_OLD_SUFFIX

See Also:: Constant Field Values

CONNECTION_RETRIES

public static final int CONNECTION_RETRIES

See Also:: Constant Field Values

Constructor Detail

HTMLCrawler

public HTMLCrawler()

HTMLCrawler

public HTMLCrawler(CrawlerConfiguration crawlerConf)

Initializes the state using the supplied configuration

Parameters:: crawlerConf - Crawler configuration

Method Detail

ignoresPolicy

public boolean ignoresPolicy()

See Also:: CrawlerConfiguration.isPolicyIgnored()

setIgnoresPolicy

public void setIgnoresPolicy(boolean ignoresPolicy)

See Also:: CrawlerConfiguration.setPolicyIgnored(boolean)

setGlobalCrawlDelayMinimum

public void setGlobalCrawlDelayMinimum(int globalCrawlDelayMinimum)

See Also:: CrawlerConfiguration.setGlobalCrawlDelayMinimum(int)

getGlobalCrawlDelayMinimum

public int getGlobalCrawlDelayMinimum()

See Also:: CrawlerConfiguration.getGlobalCrawlDelayMinimum()

isFakeReferrer

public boolean isFakeReferrer()

See Also:: CrawlerConfiguration.isFakeReferrer()

setFakeReferrer

public void setFakeReferrer(boolean fakeReferrer)

See Also:: CrawlerConfiguration.setFakeReferrer(boolean)

bootstrapURLs

public int bootstrapURLs(Collection<String> urls)

Adds valid (matched by a defined pattern) URLs from the supplied collection to the DB.

Parameters:: urls - The list of URLs to add
Returns:: Number of valid URLs added to DB
See Also:: EntityDescriptor, HostDescriptor

bootstrapFromFile

public void bootstrapFromFile(File bootFile)

Reads a file containing list of URL and calls bootStrapURLs.

initWithContext

protected void initWithContext()

Deserializes crawler configuration from XML and initializes crawler state

Overrides:: initWithContext in class AbstractObjectProcessor

initPostContext

protected void initPostContext()

Overrides:: initPostContext in class AbstractObjectProcessor

run

public void run()

Starts the crawling threads, waits for all to die, then stops. Looks for a bootstrap file which should contain a list of URLs to bootstrap the crawler. Once the list is read, it is renamed, so that the same URLs won't be read again. Also, before starting, checks for locked URLs (from previous run) and unlocks them.

Specified by:: run in interface Runnable
Specified by:: run in interface ObjectProcessor
Overrides:: run in class AbstractObjectProcessor

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

xsmeral.semnet.crawler Class HTMLCrawler

Configuration

Bootstrapping

Crawling

DEF_BOOTSTRAP_FILE

BOOTSTRAP_OLD_SUFFIX

CONNECTION_RETRIES

HTMLCrawler

HTMLCrawler

ignoresPolicy

setIgnoresPolicy

setGlobalCrawlDelayMinimum

getGlobalCrawlDelayMinimum

isFakeReferrer

setFakeReferrer

bootstrapURLs

bootstrapFromFile

initWithContext

initPostContext

run

xsmeral.semnet.crawler
Class HTMLCrawler