Wednesday, 30 October 2013

The best open dataset about Cinema: Freebase. aka "how I fight with Freebase in order to push it in my database"

For my work about Linked Data in the Digital Library domain (and for personal passion in Cinema too) i investigated on how to have the most complete open dataset about movies in my local database.

After some quality evaluation I identified Freebase as my target.
These are the steps I used:

0) Try out freebase export by google. It does not work, their rdf has some syntactic problems.

1) Discovered :baseKb . Downloaded the most recent version (some month ago) :A copy of :BaseKB Lime derived from the 2012-02-10 Freebase RDF dump obtained using  this tool:
Thanks paul Houle!
This tool cleans the original dump and make it loadable in a database.

2) Realize that it's very big.
With my test machine (4 cpu and 16 gb RAM) and one of the best RDF triple stores, Virtuoso 7 I could not end the data loading.

3) Built a Virtuoso script (and some manual iteratio):
  • Load the dataset in parts (4)
  • Created the list of object types related to the Cinema domain, manually from the Freebase website
  • Get the IDs of all the resources related to the cinema domain.
  • Load the dataset in parts 
  • Export all the triples related to the IDs selected before.

LessonS learned:
  • The list of types is not complete (awards are not there for example)
  • Use a database for this kind of bulk processing is not ideal but it lets me use a tool I am familiar. Alternatives will be the topic of another source.
  • A lot of data are not usefull for me, the latest version of BaseKb is split in parts (see the news) and I could choose for which part to download. It's not free to download (someone has to pay for the big transfert!) but it will be the next step
Where I use it:
  • Used in this hackathon i organized. Very nice.
  • I will enrich the information of a video library
  • Just started to play with queries
  • Dreaming, some movies reccomendation.