ArtNet is a semantic network of works of art created using SemNet.
It contains data collected from ČSFD.cz and DatabazeKnih.cz during may 2011, in the extent of
- 244 000 movies,
- 58 000 actors/directors,
- 73 000 books,
- 23 000 literary authors,
and millions of relationships. Entities in ArtNet are instances of WordNet "classes" (synsets).
Schema
The schema of the collected data is defined in the scrapers, which can be downloaded in the Scrapers section, and has roughly the following structure:
Film:
- title(s)
- director(s)
- actor(s)
- genre(s)
- country(ies) of origin
- year of release
- duration
- IMDb link
- official webpage
Actor/Director:
- name
- type (actor/director)
- birth date
- IMDb link
Book:
- title
- author(s)
- year of publishing
- ISBN
Short story:
- title
- the containing book
- author(s)
- year of publishing
Writer:
- name
- pseudonym
- birth date
- date of demise
- country of origin
Configuration
This section contains the configuration files that were used to collect the data. These files are used by SemNet's JobRunner and processors.
- job.xml
- Required by JobRunner; defines the processor chain.
- crawler.xml
- Crawler configuration; defines hosts for crawling and their parameters.
- bootstrap.list
- The list of URLs used to bootstrap the crawling process.
- wn_map.xml
- Defines mappings between the terms of the temporary vocabulary of scrapers to the vocabulary of the built network (WordNet terms in this case).
- sesame.properties
- Configuration options for the connection to a Sesame store.
- wordnet-hyponym.rdf
- WordNet 2.0 hyponymy set was used to bootstrap the triple store, to serve as a class hierarchy.
- wn_as_class_hierarchy.rdf
- Contains two RDF statements which enable the use of WordNet as a class hierarchy.
Scrapers
Following two files are the scrapers which are basically wrappers for specific websites (CSFD.cz and DatabazeKnih.cz in this case). They depend on the SemNet, Sesame and HTMLCleaner libraries, so they can't be compiled alone and are included here only for illustration.
- CSFDScraper.java
- The scraper class for CSFD.cz. Contains scrapers for Actor/Director and Film.
- DBKnih.java
- The scraper class for DatabazeKnih.cz. Contains scrapers for Book, Short story and Writer.
The compiled files can be downloaded in the complete sample file which includes SemNet binaries, all dependencies and ArtNet configuration.
semnet_artnet_conf.zip (12 MB)
SemNet binaries, required libraries, sources, documentation, ArtNet configuration files