My 2 Tech Cents: The best open dataset about Cinema: Freebase. aka "how I fight with Freebase in order to push it in my database"

For my work about Linked Data in the Digital Library domain (and for personal passion in Cinema too) i investigated on how to have the most complete open dataset about movies in my local database.

After some quality evaluation I identified Freebase as my target.
These are the steps I used:

0) Try out freebase export by google. It does not work, their rdf has some syntactic problems.

1) Discovered :baseKb . Downloaded the most recent version (some month ago) :A copy of :BaseKB Lime derived from the 2012-02-10 Freebase RDF dump obtained using this tool:
https://github.com/paulhoule/infovore/wiki/:BaseKB%20Lime
Thanks paul Houle!
This tool cleans the original dump and make it loadable in a database.

2) Realize that it's very big.
With my test machine (4 cpu and 16 gb RAM) and one of the best RDF triple stores, Virtuoso 7 I could not end the data loading.

3) Built a Virtuoso script (and some manual iteratio):

Load the dataset in parts (4)
Created the list of object types related to the Cinema domain, manually from the Freebase website
Get the IDs of all the resources related to the cinema domain.
Load the dataset in parts
Export all the triples related to the IDs selected before.

LessonS learned:

The list of types is not complete (awards are not there for example)
Use a database for this kind of bulk processing is not ideal but it lets me use a tool I am familiar. Alternatives will be the topic of another source.
A lot of data are not usefull for me, the latest version of BaseKb is split in parts (see the news) and I could choose for which part to download. It's not free to download (someone has to pay for the big transfert!) but it will be the next step

Where I use it:

Used in this hackathon i organized. Very nice.
I will enrich the information of a video library
Just started to play with queries
Dreaming, some movies reccomendation.

Menu

Wednesday, 30 October 2013

The best open dataset about Cinema: Freebase. aka "how I fight with Freebase in order to push it in my database"

No comments :

Post a Comment