Movie and TV Show Recomendations

In this blog post, I will be using the Scrapy Python package to extract data from the IMDB website. I will build a simple recommendation system that will recommend movies or TV shows based on the number of actors a title shares with my favorite TV show, Friends.

§1. Setup

§1.1 Locate the Starting IMDB Page

Pick your favorite movie or TV show, and locate its IMDB page. For example, my favorite TV show is Friends. Its IMDB page is at:

https://www.imdb.com/title/tt0108778/

Save this URL for a moment.

§1.2 Here’s how we set up your scrapy project

  1. Create a new GitHub repository, and sync it with GitHub Desktop. This repository will house your scraper. You should commit and push each time you make significant changes to your code. Here’s a link to my project repository
    https://github.com/EGZJ17/IMDB-Scrapy-
    
  2. Open a terminal in the location of your repository on your laptop, and type:
    conda activate PIC16B
    scrapy startproject IMDB_scraper
    cd IMDB_scraper
    

    This will create quite a lot of files, but you don’t really need to touch most of them.

§2. Write Your Scraper

Create a file inside the spiders directory called imdb_spider.py. Add the following lines to the file:

import scrapy

class ImdbSpider(scrapy.Spider):
    name = 'imdb_spider'
    
    start_urls = ['hhttps://www.imdb.com/title/tt0108778/']

Replace the entry of start_urls with the URL corresponding to your favorite movie or TV show.

§2.1 Now, implement three parsing methods for the ImdbSpider class.

  1. parse(self, response)
    def parse(self,response):
         '''
         A parsing method that navigates to its Cast and Crew page.
         '''
         # Join current url with "fullcredits"
         full_credits = response.urljoin("fullcredits")
         yield scrapy.Request(full_credits, callback = self.parse_full_credits)
    

    This method works by starting on a movie or TV show’s home page, and then navigates to the Cast & Crew page. Once arrived at the full credits page, a second function, parse_full_credits(self,response), is called in the callback argument of scrapy.Request.

  2. parse_full_credits(self, response)
    def parse_full_credits(self,response):
         '''
         A parsing method that navigates to each actor's page
         '''
         # a list to get all urls to each actor's IMDB page
         actor_page_list = [a.attrib["href"] for a in response.css("td.primary_photo a")]
    
         # navigate to each actor's IMDB page
         for next_page in actor_page_list:
             next_page = response.urljoin(next_page)
             yield scrapy.Request(next_page, callback = self.parse_actor_page)
    

    In this method, we start on the Cast & Crew page and try to navigate through each actor’s page. The above list comprehension will create a list of relative paths, one for each actor. To find the url to the actor pages, we can make use of the web developer tool: Primary-Photo.png We see the url to the actor page is the "href" attribute in a in the element "td.primary_photo a". This information can be extracted using response.css. Once arrived at the actor’s page, the third function parse_actor_page(self, response) is called, again in the callback argument of scrapy.Request.

  3. parse_actor_page(self, response)
    def parse_actor_page(self, response):
         '''
         A parsing method to get all movies and TV shows that an actor
         is listed in and put in dictionary.
         '''
         #get actor name
         actor_name = response.css("h1.header").css("span.itemprop::text").get()
    
         # use set to avoid multiple entries
         movie_or_TV_list = set([a.get() for a in response.css("div.filmo-row b").css("a::text")])
            
         #yield dictionary
         for a in movie_or_TV_list:
             movie_or_TV_name = a
             yield {"actor" : actor_name, "movie_or_TV_name" : movie_or_TV_name}
    

    To write this method, we start on the page of the actor. This method yields a dictionary with two key-value pairs, of the form {"actor" : actor_name, "movie_or_TV_name" : movie_or_TV_name}. The method yields one such dictionary for each of the movies or TV shows on which that actor has worked. To get the actor’s name and what they worked on, we need the web developer tool: actor_name.png We see the actor_name is in the class "h1.header span.itemprop"

Movie-tv.png We see the movie_or_TV_name is in "div.filmo-row b a" element.

We use the .get() method since they both contain text information. We use set() to avoid multiple entries when the actors also are apart of movies or tv shows as other occupations besides acting, such as directing or producing. Then we yield our desired dictionary.

§ 2.2 Now our spider is ready!

Now, you can run the command in the terminal

scrapy crawl imdb_spider -o results.csv

to create a .csv file with a column for actors and a column for movies and TV shows.

§3. Make the Recommendations

Now let’s compute a sorted list with the top movies and TV shows that share actors with my favorite TV show, Friends! Since our data in a csv file, we need to do some cleaning so that we can visualize our results.

First let’s import pandas and read results.csv as a pandas data frame:

import pandas as pd

# read results.csv
results = pd.read_csv("results.csv")

Now, we need to convert the movie_or_TV_name column to a set to avoid multiple entries and then change it back to a list in order to iterate over it. Then, we need to count the number of times the movie or TV name has appeared in the results data frame by using .sum() method, and record the numbers with list comprehension:

# convert to set to avoid multiple entries and back to list for iterable
mov_li = list(set(results["movie_or_TV_name"].tolist()))

# record the number of shared actors by list comprehension
num_actors = [(results.movie_or_TV_name == movie).sum() for movie in mov_li]

Lastly, we will make a new dataframe with the movie or TV Show and number of shared actors. We will also sort by descending order:

# put data from dictionary in a pandas dataframe
df = pd.DataFrame(data={"movie or TV show":mov_li, "number of shared actors":num_actors})

# sort by descending order
df = rec.sort_values(by="number of shared actors", ascending=False )

#reassign the index
df = df.reset_index()
df = df.drop("index", axis =1)

# Show the top 20 recommedations
df.head(20)
movie or TV show number of shared actors
0 Friends 823
1 ER 182
2 Entertainment Tonight 136
3 CSI: Crime Scene Investigation 135
4 Seinfeld 122
5 Today 116
6 NCIS 114
7 The Tonight Show with Jay Leno 110
8 NYPD Blue 105
9 Jimmy Kimmel Live! 103
10 Criminal Minds 95
11 Grey's Anatomy 93
12 Late Night with Conan O'Brien 92
13 Diagnosis Murder 92
14 Days of Our Lives 90
15 The West Wing 87
16 Celebrity Page 86
17 The Drew Carey Show 85
18 Charmed 84
19 Without a Trace 83

Now, we have a clear list of movies and tv shows based on the number of shared actors on Friends!

We can also use plotly to display the results

First you need to import the following packages by running the following code:

from plotly import express as px

Let’s Plot a Histogram (the x-axis represents the name of each movieor TV Show and the y-axis represents the number of shared actors). Run the following code:

fig = px.histogram(df, 
                   x = "movie or TV show",
                   y = "number of shared actors",
                   width = 600,
                   height = 300)

fig.update_layout(margin={"r":30,"t":170,"l":0,"b":0})

fig.show()

We see of course that Friends has the most amount of shared actors, which makes sense!

Written on April 30, 2022