Wednesday, 20 November 2013

My Cinema Knowledge: "my movies" aka Multi-language reconciliation using Freebase

In this first episode of " My Cinema Knowledge" I will try to describe my film catalog mixing private information (my disk folders) with public ones (Freebase)
I will use Ubuntu, Open Refine, a little python script and a RDF store, Virtuoso.

Step By Step How-to

  • Build a csv with all the folders names in my disk using a linux command ( find . -type d > myMovies.csv )
  • import in open refine (I used lod refine, a package including open refine and the rdf extension)
  • Extracted movie name from folder name taking only the last part of the location
  • Added a reconciliation service based on the freebase dump created previously (the making of is described in This post) imported in a Virtuoso triple store
    For this I used the SPARQL based reconciliation service feature of the RDF extension
    Using a custom reconciliation service over freebase I will not be limited to the english languages provided by the Freebase reconciliation service
  • After 10 minutes on a 8gb Ram machine, this the results (out of about 310):
    • 138 movies automatically recognized 
    • 66 movies with multiple choices (semi automatic)
    • 109 without a match
  • Reason for the missing matches are:
    • Missing in Freebase (mostly italina movies)
    • Missing italian title in Freebase
    • Missing in my Freebase copy
    • Some intermediate folder (about 15)
  • I also got a severe BUG in selecting new matches: (grrrr)
  • UPDATE!!!
    There is also a cloud based reconciliation service for freebase, now working also with italian language. It should be included in open refine but it does not work for this bug You can make it work creating a new standard one using this address: 
  • Copy reconciled data in a new column 
  • Exported csv. on raw for example is:
    ./doppiati/1984 , 1984,
  • Transformed the csv to rdf using a Python script as simple as this  using python-rdflibsudo apt-get install python-pip (ubuntu)
    sudo pip install rdflib
    To use in this way:
    python myMovies-csv.csv myMovies.ttl
  • Upload data into my RDF store
  • Enjoy data analysis NOW!
    In the first attemp i used SPARQL queries in order to get the genre ranking, the directors ranking and the director nationality ranking. A first attemp now, some more will come soon!

    An interactive version here

No comments :

Post a Comment