My 2 Tech Cents: My Cinema Knowledge: "my movies" aka Multi-language reconciliation using Freebase

In this first episode of " My Cinema Knowledge" I will try to describe my film catalog mixing private information (my disk folders) with public ones (Freebase)
I will use Ubuntu, Open Refine, a little python script and a RDF store, Virtuoso.

Step By Step How-to

Build a csv with all the folders names in my disk using a linux command ( find . -type d > myMovies.csv )
import in open refine (I used lod refine, a package including open refine and the rdf extension)
Extracted movie name from folder name taking only the last part of the location
Added a reconciliation service based on the freebase dump created previously (the making of is described in This post) imported in a Virtuoso triple store
For this I used the SPARQL based reconciliation service feature of the RDF extension
Using a custom reconciliation service over freebase I will not be limited to the english languages provided by the Freebase reconciliation service
After 10 minutes on a 8gb Ram machine, this the results (out of about 310):

138 movies automatically recognized
66 movies with multiple choices (semi automatic)
109 without a match

Reason for the missing matches are:

Missing in Freebase (mostly italina movies)
Missing italian title in Freebase
Missing in my Freebase copy
Some intermediate folder (about 15)

I also got a severe BUG in selecting new matches: https://github.com/fadmaa/grefine-rdf-extension/issues/82 (grrrr)
UPDATE!!!
There is also a cloud based reconciliation service for freebase, now working also with italian language. It should be included in open refine but it does not work for this bug https://github.com/OpenRefine/OpenRefine/issues/805. You can make it work creating a new standard one using this address:
```
http://reconcile.freebaseapps.com/reconcile 
```
Copy reconciled data in a new column
Exported csv. on raw for example is:
./doppiati/1984 , 1984, http://rdf.freebase.com/ns/m.03kp2l
Transformed the csv to rdf using a Python script as simple as this using python-rdflibsudo apt-get install python-pip (ubuntu)
```
sudo pip install rdflib
```
To use in this way:
python myMoviesToRDF.py myMovies-csv.csv myMovies.ttl
Upload data into my RDF store
Enjoy data analysis NOW!
In the first attemp i used SPARQL queries in order to get the genre ranking, the directors ranking and the director nationality ranking. A first attemp now, some more will come soon!

An interactive version here

Menu

Wednesday, 20 November 2013

My Cinema Knowledge: "my movies" aka Multi-language reconciliation using Freebase

No comments :

Post a Comment