Wednesday, 20 November 2013

My Cinema Knowledge: "my movies" aka Multi-language reconciliation using Freebase


In this first episode of " My Cinema Knowledge" I will try to describe my film catalog mixing private information (my disk folders) with public ones (Freebase)
I will use Ubuntu, Open Refine, a little python script and a RDF store, Virtuoso.




Step By Step How-to

  • Build a csv with all the folders names in my disk using a linux command ( find . -type d > myMovies.csv )
  • import in open refine (I used lod refine, a package including open refine and the rdf extension)
  • Extracted movie name from folder name taking only the last part of the location
  • Added a reconciliation service based on the freebase dump created previously (the making of is described in This post) imported in a Virtuoso triple store
    For this I used the SPARQL based reconciliation service feature of the RDF extension
    Using a custom reconciliation service over freebase I will not be limited to the english languages provided by the Freebase reconciliation service
  • After 10 minutes on a 8gb Ram machine, this the results (out of about 310):
    • 138 movies automatically recognized 
    • 66 movies with multiple choices (semi automatic)
    • 109 without a match
  • Reason for the missing matches are:
    • Missing in Freebase (mostly italina movies)
    • Missing italian title in Freebase
    • Missing in my Freebase copy
    • Some intermediate folder (about 15)
  • I also got a severe BUG in selecting new matches: https://github.com/fadmaa/grefine-rdf-extension/issues/82 (grrrr)
  • UPDATE!!!
    There is also a cloud based reconciliation service for freebase, now working also with italian language. It should be included in open refine but it does not work for this bug https://github.com/OpenRefine/OpenRefine/issues/805. You can make it work creating a new standard one using this address:

    http://reconcile.freebaseapps.com/reconcile 
  • Copy reconciled data in a new column 
  • Exported csv. on raw for example is:
    ./doppiati/1984 , 1984, http://rdf.freebase.com/ns/m.03kp2l
  • Transformed the csv to rdf using a Python script as simple as this  using python-rdflibsudo apt-get install python-pip (ubuntu)
    sudo pip install rdflib
    To use in this way:
    python myMoviesToRDF.py myMovies-csv.csv myMovies.ttl
  • Upload data into my RDF store
  • Enjoy data analysis NOW!
    In the first attemp i used SPARQL queries in order to get the genre ranking, the directors ranking and the director nationality ranking. A first attemp now, some more will come soon!


    An interactive version here



Monday, 11 November 2013

Fact checking : Fassina vs wallstreetitalia


who : viceministro dell'Economia, Stefano Fassina:

 Il taglio delle pensioni d'oro, anche nell'ipotesi di considerare 'd'oro' le pensioni superiori a 3500 euro netti mensili, implica risparmi di alcune centinaia di milioni di euro all'anno".

source:
http://www.repubblica.it/politica/2013/11/08/news/reddito_di_cittadinanza_la_proposta_di_grillo_copertura_con_imu_su_immobili_della_chiesa_e_taglio_pensioni_d_oro-70523273/

who: wallstreetitalia

Nel 2011, il 5,2% dei pensionati (861mila persone in tutto), che percepisce un assegno mensile superiore ai tremila euro, ha assorbito in tutto 45 miliardi, vale a dire il 17% della spesa previdenziale. Poco meno di quanto sborsato per i 7,3 milioni di italiani, il 44% del totale, il cui reddito non supera i mille euro al mese. In cifre 51 miliardi in tutto, pari al 19,2% della spesa complessiva.

source: 
http://www.wallstreetitalia.com/article/1641597/le-pensioni-d-oro-costano-45-miliardi.aspx

Io:
Mi sono perso qsa?