For my work about Linked Data in the Digital Library domain (and for personal passion in Cinema too) i investigated on how to have the most complete open dataset about movies in my local database.
After some quality evaluation I identified Freebase as my target.
These are the steps I used:
0) Try out freebase export by google. It does not work, their rdf has some syntactic problems.
1) Discovered :baseKb . Downloaded the most recent version (some month ago) :A copy of :BaseKB Lime derived from the 2012-02-10 Freebase RDF dump obtained using this tool:
https://github.com/paulhoule/infovore/wiki/:BaseKB%20Lime
Thanks paul Houle!
This tool cleans the original dump and make it loadable in a database.
2) Realize that it's very big.
With my test machine (4 cpu and 16 gb RAM) and one of the best RDF triple stores, Virtuoso 7 I could not end the data loading.
3) Built a Virtuoso script (and some manual iteratio):
- Load the dataset in parts (4)
- Created the list of object types related to the Cinema domain, manually from the Freebase website
- Get the IDs of all the resources related to the cinema domain.
- Load the dataset in parts
- Export all the triples related to the IDs selected before.
LessonS learned:
- The list of types is not complete (awards are not there for example)
- Use a database for this kind of bulk processing is not ideal but it lets me use a tool I am familiar. Alternatives will be the topic of another source.
- A lot of data are not usefull for me, the latest version of BaseKb is split in parts (see the news) and I could choose for which part to download. It's not free to download (someone has to pay for the big transfert!) but it will be the next step
- Used in this hackathon i organized. Very nice.
- I will enrich the information of a video library
- Just started to play with queries
- Dreaming, some movies reccomendation.