xsmeral.semnet.scraper
Class AbstractScraper

java.lang.Object
  extended by xsmeral.pipe.AbstractObjectProcessor
      extended by xsmeral.pipe.LocalObjectFilter<EntityDocument,Statement>
          extended by xsmeral.semnet.scraper.AbstractScraper
All Implemented Interfaces:
Runnable, xsmeral.pipe.context.ContextAware, xsmeral.pipe.interfaces.ObjectProcessor, xsmeral.pipe.interfaces.ObjectSink<EntityDocument>, xsmeral.pipe.interfaces.ObjectSource<Statement>

@ObjectProcessorInterface(in=EntityDocument.class,
                          out=org.openrdf.model.Statement.class)
public abstract class AbstractScraper
extends xsmeral.pipe.LocalObjectFilter<EntityDocument,Statement>

A scraper works in co-operation with crawler, extracting data from web pages. This implementation uses Sesame Statements to represent facts and XPath to extract data.
The only method that needs to be implemented is the scrape(EntityDocument) method, which should scrape one EntityDocument. If more entity types are processed in one processing chain, the ScraperWrapper can be used to route EntityDocuments to correct scrapers, based on their class (and configuration). Convenience methods like fact, uri or lit are provided. Also, the XPathUtil can be used, providing simple methods for querying the DOM tree.

See Also:
Stats, ScraperWrapper
Initialization parameters
stats - (optional) Name of stats group for this scraper. Default is simple class name.

Nested Class Summary
 
Nested classes/interfaces inherited from interface xsmeral.pipe.interfaces.ObjectProcessor
xsmeral.pipe.interfaces.ObjectProcessor.Status
 
Field Summary
protected  EntityDocument doc
           
protected static ValueFactory f
           
 
Fields inherited from class xsmeral.pipe.AbstractObjectProcessor
canStart, context, status
 
Constructor Summary
AbstractScraper()
           
 
Method Summary
protected  URI current()
          Returns currently processed URL as an instance of URI.
protected  void fact(Resource sub, URI pred, Value obj)
          Writes a statement composed of given subject, predicate and object to the output.
protected  void fact(URI pred, Value obj)
          Writes a statement composed of current() as subject and given predicate and object to the output.
abstract  String getNamespace()
          Returns namespace used by this scraper.
protected static ValueFactory getValueFactory()
          Returns a ValueFactory (instantiated at initialization).
 Map<URI,String> getVocabulary()
          Returns map of all terms and their definitions in this scraper's vocabulary (fields of type URI annotated with Term).
protected  void initPostContext()
          Initializes the stats.
protected  Value lit(String literal)
          Returns Sesame Literal for the specified string.
protected  void preRun()
          Outputs a statement (uri, RDF.TYPE, RDFS.CLASS) for each field of type URI annotated with EntityClass.
protected  void process()
          Calls scrape(EntityDocument) and catches any Exception, logging it as a parsing error.
protected abstract  void scrape(EntityDocument doc)
          Scrapes one document and outputs any number of facts.
protected  URI uri(String uri)
          Returns given URI normalized and resolved against current base URL if relative.
 
Methods inherited from class xsmeral.pipe.LocalObjectFilter
getNext, getOutBuffer, getPrev, handleStoppedSink, handleStoppedSource, next, prev, read, requestStop, setNext, setOutBuffer, write
 
Methods inherited from class xsmeral.pipe.AbstractObjectProcessor
canStart, failStart, failStart, failStart, getContext, getInType, getOutType, getParams, getStatus, initContext, initContextSet, initialize, initializeInternal, initWithContext, postRun, run, setContext, stop, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface xsmeral.pipe.interfaces.ObjectSink
getInType
 
Methods inherited from interface xsmeral.pipe.interfaces.ObjectSource
getOutType
 

Field Detail

f

protected static final ValueFactory f

doc

protected EntityDocument doc
Constructor Detail

AbstractScraper

public AbstractScraper()
Method Detail

getValueFactory

protected static ValueFactory getValueFactory()
Returns a ValueFactory (instantiated at initialization).


fact

protected void fact(Resource sub,
                    URI pred,
                    Value obj)
             throws xsmeral.pipe.ProcessorStoppedException
Writes a statement composed of given subject, predicate and object to the output.

Parameters:
sub - The subject
pred - The predicate
obj - The object
Throws:
xsmeral.pipe.ProcessorStoppedException

fact

protected void fact(URI pred,
                    Value obj)
             throws xsmeral.pipe.ProcessorStoppedException
Writes a statement composed of current() as subject and given predicate and object to the output.

Throws:
xsmeral.pipe.ProcessorStoppedException
See Also:
fact(org.openrdf.model.Resource, org.openrdf.model.URI, org.openrdf.model.Value)

lit

protected Value lit(String literal)
Returns Sesame Literal for the specified string. Also performs decoding of HTML special entities.

Parameters:
literal - The literal string
Returns:
Literal value

uri

protected URI uri(String uri)
           throws URISyntaxException,
                  MalformedURLException
Returns given URI normalized and resolved against current base URL if relative.

Parameters:
uri - The URI to normalize
Returns:
Normalized and resolved URI
Throws:
MalformedURLException - If the given string does not contain a valid URL
URISyntaxException

current

protected URI current()
Returns currently processed URL as an instance of URI.


getVocabulary

public Map<URI,String> getVocabulary()
Returns map of all terms and their definitions in this scraper's vocabulary (fields of type URI annotated with Term).


preRun

protected void preRun()
               throws xsmeral.pipe.ProcessorStoppedException
Outputs a statement (uri, RDF.TYPE, RDFS.CLASS) for each field of type URI annotated with EntityClass.

Overrides:
preRun in class xsmeral.pipe.AbstractObjectProcessor
Throws:
xsmeral.pipe.ProcessorStoppedException

initPostContext

protected void initPostContext()
Initializes the stats.

Overrides:
initPostContext in class xsmeral.pipe.AbstractObjectProcessor
See Also:
Stats

process

protected void process()
                throws xsmeral.pipe.ProcessorStoppedException
Calls scrape(EntityDocument) and catches any Exception, logging it as a parsing error.

Overrides:
process in class xsmeral.pipe.LocalObjectFilter<EntityDocument,Statement>
Throws:
xsmeral.pipe.ProcessorStoppedException

getNamespace

public abstract String getNamespace()
Returns namespace used by this scraper.


scrape

protected abstract void scrape(EntityDocument doc)
                        throws Exception
Scrapes one document and outputs any number of facts.

Parameters:
doc - The document to scrape
Throws:
Exception - Can throw any exception