|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectxsmeral.pipe.AbstractObjectProcessor
xsmeral.pipe.LocalObjectFilter<EntityDocument,Statement>
xsmeral.semnet.scraper.AbstractScraper
@ObjectProcessorInterface(in=EntityDocument.class, out=org.openrdf.model.Statement.class) public abstract class AbstractScraper
A scraper works in co-operation with crawler, extracting data from web pages.
This implementation uses Sesame Statement
s to represent facts and
XPath to extract data.
The only method that needs to be implemented is the scrape(EntityDocument)
method, which should scrape one EntityDocument. If more entity types are
processed in one processing chain, the ScraperWrapper
can be used
to route EntityDocuments to correct scrapers, based on their class (and configuration).
Convenience methods like fact
,
uri
or lit
are
provided. Also, the XPathUtil
can be used, providing simple methods for
querying the DOM tree.
Stats
,
ScraperWrapper
stats
- (optional) Name of stats group for this scraper. Default is simple class name.Nested Class Summary |
---|
Nested classes/interfaces inherited from interface xsmeral.pipe.interfaces.ObjectProcessor |
---|
xsmeral.pipe.interfaces.ObjectProcessor.Status |
Field Summary | |
---|---|
protected EntityDocument |
doc
|
protected static ValueFactory |
f
|
Fields inherited from class xsmeral.pipe.AbstractObjectProcessor |
---|
canStart, context, status |
Constructor Summary | |
---|---|
AbstractScraper()
|
Method Summary | |
---|---|
protected URI |
current()
Returns currently processed URL as an instance of URI. |
protected void |
fact(Resource sub,
URI pred,
Value obj)
Writes a statement composed of given subject, predicate and object to the output. |
protected void |
fact(URI pred,
Value obj)
Writes a statement composed of current() as subject
and given predicate and object to the output. |
abstract String |
getNamespace()
Returns namespace used by this scraper. |
protected static ValueFactory |
getValueFactory()
Returns a ValueFactory (instantiated at initialization). |
Map<URI,String> |
getVocabulary()
Returns map of all terms and their definitions in this scraper's vocabulary (fields of type URI annotated with Term ). |
protected void |
initPostContext()
Initializes the stats. |
protected Value |
lit(String literal)
Returns Sesame Literal for the specified string. |
protected void |
preRun()
Outputs a statement ( uri , RDF.TYPE , RDFS.CLASS )
for each field of type URI annotated with EntityClass . |
protected void |
process()
Calls scrape(EntityDocument)
and catches any Exception, logging it as a parsing error. |
protected abstract void |
scrape(EntityDocument doc)
Scrapes one document and outputs any number of facts. |
protected URI |
uri(String uri)
Returns given URI normalized and resolved against current base URL if relative. |
Methods inherited from class xsmeral.pipe.LocalObjectFilter |
---|
getNext, getOutBuffer, getPrev, handleStoppedSink, handleStoppedSource, next, prev, read, requestStop, setNext, setOutBuffer, write |
Methods inherited from class xsmeral.pipe.AbstractObjectProcessor |
---|
canStart, failStart, failStart, failStart, getContext, getInType, getOutType, getParams, getStatus, initContext, initContextSet, initialize, initializeInternal, initWithContext, postRun, run, setContext, stop, toString |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Methods inherited from interface xsmeral.pipe.interfaces.ObjectSink |
---|
getInType |
Methods inherited from interface xsmeral.pipe.interfaces.ObjectSource |
---|
getOutType |
Field Detail |
---|
protected static final ValueFactory f
protected EntityDocument doc
Constructor Detail |
---|
public AbstractScraper()
Method Detail |
---|
protected static ValueFactory getValueFactory()
protected void fact(Resource sub, URI pred, Value obj) throws xsmeral.pipe.ProcessorStoppedException
sub
- The subjectpred
- The predicateobj
- The object
xsmeral.pipe.ProcessorStoppedException
protected void fact(URI pred, Value obj) throws xsmeral.pipe.ProcessorStoppedException
current()
as subject
and given predicate and object to the output.
xsmeral.pipe.ProcessorStoppedException
fact(org.openrdf.model.Resource, org.openrdf.model.URI, org.openrdf.model.Value)
protected Value lit(String literal)
literal
- The literal string
protected URI uri(String uri) throws URISyntaxException, MalformedURLException
uri
- The URI to normalize
MalformedURLException
- If the given string does not contain a valid URL
URISyntaxException
protected URI current()
public Map<URI,String> getVocabulary()
Term
).
protected void preRun() throws xsmeral.pipe.ProcessorStoppedException
uri
, RDF.TYPE
, RDFS.CLASS
)
for each field of type URI
annotated with EntityClass
.
preRun
in class xsmeral.pipe.AbstractObjectProcessor
xsmeral.pipe.ProcessorStoppedException
protected void initPostContext()
initPostContext
in class xsmeral.pipe.AbstractObjectProcessor
Stats
protected void process() throws xsmeral.pipe.ProcessorStoppedException
scrape(EntityDocument)
and catches any Exception, logging it as a parsing error.
process
in class xsmeral.pipe.LocalObjectFilter<EntityDocument,Statement>
xsmeral.pipe.ProcessorStoppedException
public abstract String getNamespace()
protected abstract void scrape(EntityDocument doc) throws Exception
doc
- The document to scrape
Exception
- Can throw any exception
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |