In this first episode of " My Cinema Knowledge" I will try to describe my film catalog mixing private information (my disk folders) with public ones (Freebase)
I will use Ubuntu, Open Refine, a little python script and a RDF store, Virtuoso.
Step By Step How-to
- Build a csv with all the folders names in my disk using a linux command ( find . -type d > myMovies.csv )
- import in open refine (I used lod refine, a package including open refine and the rdf extension)
- Extracted movie name from folder name taking only the last part of the location
- Added a reconciliation service based on the freebase dump created previously (the making of is described in This post) imported in a Virtuoso triple store
For this I used the SPARQL based reconciliation service feature of the RDF extension
Using a custom reconciliation service over freebase I will not be limited to the english languages provided by the Freebase reconciliation service - After 10 minutes on a 8gb Ram machine, this the results (out of about 310):
- 138 movies automatically recognized
- 66 movies with multiple choices (semi automatic)
- 109 without a match
- Reason for the missing matches are:
- Missing in Freebase (mostly italina movies)
- Missing italian title in Freebase
- Missing in my Freebase copy
- Some intermediate folder (about 15)
- I also got a severe BUG in selecting new matches: https://github.com/fadmaa/grefine-rdf-extension/issues/82 (grrrr)
- UPDATE!!!
There is also a cloud based reconciliation service for freebase, now working also with italian language. It should be included in open refine but it does not work for this bug https://github.com/OpenRefine/OpenRefine/issues/805. You can make it work creating a new standard one using this address:http://reconcile.freebaseapps.com/reconcile
- Copy reconciled data in a new column
- Exported csv. on raw for example is:
./doppiati/1984 , 1984, http://rdf.freebase.com/ns/m.03kp2l - Transformed the csv to rdf using a Python script as simple as this using python-rdflibsudo apt-get install python-pip (ubuntu)
To use in this way:sudo pip install rdflib
python myMoviesToRDF.py myMovies-csv.csv myMovies.ttl - Upload data into my RDF store
- Enjoy data analysis NOW!
In the first attemp i used SPARQL queries in order to get the genre ranking, the directors ranking and the director nationality ranking. A first attemp now, some more will come soon!
An interactive version here
No comments :
Post a Comment