This section documents the fundamentals of SemNet and guides through the steps of creating custom configurations. SemNet is also documented in Javadoc and in the thesis. For usage instructions, see Usage.

SemNet is the name for a group of processors for the PipedObjectProcessor, the underlying data processing framework. These processors were designed to serve the purpose of SemNet – crawling the web (HTMLCrawler), extracting information (Scrapers), optionally mapping terms between vocabularies (StatementMapper) and persisting the information (SesameWriter).

Piped object processor

The main reason for using a framework like POP is the flexibility it provides. The idea of a processing chain makes it possible to easily adjust the way the data is acquired, processed and persisted without affecting other processing steps.

The main entry point for processing is the JobRunner class, which executes processing jobs. A processing job is a configuration of processors.

Note that all paths in this page are relative to the sample directory of the archive semnet_artnet_conf.zip and all configuration files from sample/artnet can be seen on the ArtNet: About page.

The following snippet of code shows a relevant part of a processing job definition (artnet/job.xml):

<processorChain>
  <processor conf="crawler.xml" bootstrap="bootstrap.list">xsmeral.semnet.crawler.HTMLCrawler</processor>
  <processor>xsmeral.semnet.scraper.ScraperWrapper</processor>
  <processor mapping="wn_map.xml">xsmeral.semnet.mapper.StatementMapper</processor>
  <processor conf="sesame.properties" bootstrap="wn_as_class_hierarchy.rdf,wordnet-hyponym.rdf">xsmeral.semnet.sink.SesameWriter</processor>
</processorChain>

The order of processors determines the flow of objects through the chain – they flow top to bottom. The processors are selected using their fully-qualified class name and the corresponding class file must be in the classpath during runtime. The processors can take parameters (which act like constructor arguments), in which case these are documented in Javadoc of the respective processor, in section "Initialization parameters". An example of this might be the HTMLCrawler, which takes the conf and bootstrap parameters.

Crawler

First processor in ArtNet's job is the HTMLCrawler. It is a web robot which crawls hosts according to predefined rules. In the data model of the crawler, these rules are represented by the CrawlerConfiguration class. An example of the configuration is the artnet/crawler.xml file. It defines database connection parameters (for the crawler's URL frontier), several optional crawling parameters and a collection of host definitions.

The host definition is represented by the HostDescriptor class. This is an example of definition of one host from the crawler configuration:

Expand <host>
  <baseURL>http://www.csfd.cz/</baseURL>
  <name>CSFD.cz</name>
  <charset>UTF-8</charset>
  <crawlDelay>1500</crawlDelay>
  <sourceFirst>true</sourceFirst>
  <source>
    <pattern update="1">/kino/?</pattern>
    <pattern update="1">/tvurci/?</pattern>
    <pattern update="60">/tvurci/.+</pattern>
  </source>
  <entities>
    <entity weight="3">
      <pattern update="365">/film/[^/]+/?</pattern>
      <scraper>xsmeral.artnet.scraper.CSFDScraper$Film</scraper>
    </entity>
    <entity weight="2">
      <pattern update="365">/tvurce/[^/]+/?</pattern>
      <scraper>xsmeral.artnet.scraper.CSFDScraper$Creator</scraper>
    </entity>
  </entities>
</host> Shrink

The first part contains host-specific settings, like its base URL, a name, character encoding used on pages of the host, the minimal time delay (in milliseconds) between two requests to the host and an option to visit source URLs first in every crawling session.

The operation of the crawler can be simplified as follows:

  1. Get URL (from the frontier or bootstrap list) and download the page.
  2. Search for links and store those matching any of defined patterns.
  3. If the URL is an entity URL, pass the page to scrapers.

Next part of the code contains URL patterns of source pages, which are pages that are not entity pages (will not be scraped), but contain links to other entity pages.

<source>
  <pattern update="1">/kino/?</pattern>
  <pattern update="1">/tvurci/?</pattern>
  <pattern update="60">/tvurci/.+</pattern>
</source>

The patterns are regular expressions matching URLs which are relative to the base URL. The last part of a host definition in the crawler configuration contains definitions of entity pages.

<entity weight="3">
  <pattern update="365">/film/[^/]+/?</pattern>
  <scraper>xsmeral.artnet.scraper.CSFDScraper$Film</scraper>
</entity>

Each entity has a weight option, which defines the importance of this entity in relation to other entities, affecting the order and frequency of visiting during a crawling session. For example, if the crawler is configured with two entities – A with a weight of 2 and B the weight of 3, and visits 100 pages in a crawling session, 40 pages of A and 60 pages of B would be visited (and scraped).

The pattern is a regular expression matching a URL, relative to the base URL. The update parameter specifies how often should this page be visited and scraped again (value is in days). The scraper node specifies the scraper for this entity. Multiple scraper nodes can be specified.

Scrapers

Scrapers are an inseparable part of the crawler, yet they exist as a separate module, in the name of modularity and lower complexity of individual components. They perform the core operation in the process of semantic network building – the transformation of structure from HTML code to RDF statements. The scrapers in SemNet are implementations of the class AbstractScraper. They receive an EntityDocument and output Statements. The AbstractScraper offers all necessary methods.

Since each entity may have multiple scrapers defined and each host might have multiple entities (and multiple hosts may be defined in the configuration), the class ScraperWrapper is used as a "router", dispatching EntityDocuments to scrapers based on URL patterns.

Examples of scrapers can be seen on the ArtNet: About page.

This page is still under construction.
Additional content will be available soon.