Movie star popularity based on graph analysis
3 June 2022
The internet exists out of an endless pool of raw data, not all data is worth the effort to analyse. But our colleague Jens saw that the Internet movie database (IMDb) was a diamond in the rough waiting to be mined. So, he did what every curious Data Scientist would do, he rolled up his sleeves and got to work. He used graph analysis to investigate the popularity of actors, actresses, and directors.
The Internet Movie Database, also known as IMDb, is an online database of information related to films and series. Subsets of the IMDb dataset are refreshed daily and available to download. These datasets are used to seed who wins this Hollywood popularity contest and who was most important between 1980 – 1990 and how does this evolve over time?
In our approach, we will use Neo4j as a graph management system, which is a good choice for high-performance and scalable analysis. The nodes in our network are all actors, actresses, directors, titles, and the edges are all roles in movies with at least one vote. This results in a graph with 1.8 million persons, 1.2 million titles, and 11.6 million relations. Birthyear, name, role, year, rating, and votes are added as node or edge properties.
Our setup consists of a Docker container based on the official Neo4j image and some Python scripts to parse, stream, query and visualize everything. Note that for this example, 4GB of free memory was enough but we could deploy our setup in the cloud with minimal additional effort via an Azure Container Instances (ACI) since it’s already containerized.
Node centrality
Centrality algorithms are used to define the importance of distinct nodes in a network. Popular choices are PageRank, Betweenness, or Degree centrality. The latter is often used to determine the most important people in a social network. Thus, we will use weighted degree centrality (with the number of votes as weight of the edges) to determine popularity. A horizontal bar chart is used to visualize everything dynamically (with a moving window of 10 years and a script to add images and the most popular movies in that period).
Workflow
How did Jens do this? You can see step by step his way of working on the data. This is the technical part, want to see the outcome of our three scenarios? Scroll all the way down!
- Parse the IMDb *.tsv.gz files (title ratings, principals, crew, basics and title names)
- Extract required information: actors, actresses, directors, writers, movies, and metadata (year, votes, ratings, ...)
- Add contraint on person id and title id (important to increase performance)
- Stream all relevant persons, titles, and relation to graph in batches via driver (we used the Neo4j Python Driver)
- Call weighted degree centrality algorithm for all windows and filters
- Use script to download an icon for all mentioned persons
- Visualize via barchart in matplotlib, animate, and save as mp4
Three scenarios are investigated
- All movies between 1982 – 2022
- All movies between 1982 – 2022 with a user rating lower than 6/10
- All movies between 1982 – 2022 with a link to the Belgian scene
Scenario 1 – The best & popular
In the 1980’s we see the rise of Star Wars, Harrison Ford, Steven Spielberg, Robert Zemeckis, ... In the 1990’s it’s Morgan Freeman, Tom Hanks, Brad Pitt, Bruce Willis, Quentin Tarantino, … and around 2000 it’s The Matrix, Fight Club, and Lord of the Rings. Christopher Nolan and his favourite actors start to rise from 2006 onwards and in 2012 we are introduced to the Avengers.
Scenario 2 – The not so good & popular
Since all ‘good’ movies are not included in our graph, other actors claim their rightful spots. Top contenders: Arnold Schwarzenegger, Sylvester Stallone, Jean-Claude Van Damme, Angelina Jolie, Jessica Alba, Kristen Stewart, Will Smith, and Adam Sandler for obvious reasons.
Scenario 3 – The Belgian & popular
Living legend Jean-Claude Van Damme reigns until 2008 and is substituted by Jaco Van Dormael (Mr. Nobody), Lubna Azabal (Incendies), and Mathias Schoenaerts (The Drop, Red Sparrow, …). Honourable mentions: Gene Bervoets, Jan Decleir, Dardenne brothers, Johan Heldenbergh, Adil El Arbi and Billall Fallah.
That's it for our first part on graph analysis with IMDB data, in the follow up part we go deeper in on the shortest path algorithm and check out who knows who in Hollywood. Stay tuned!