Web scraping and recommendation system building

Introduction

Ever spend over half an hour trying to look for a good show, or movie to watch on the weekend? Ever feel like none of the tiktok videos or spotify daily recommendations are interesting enough to keep you on the app? It is probably because the platforms’ “recommendation systems” are not performing well to suggest content that suits your tastes. Reommendation systems suggest content that you might like based on your previous selections and overall user profile, and are tools pivotal to many entertainment&streaming services. While scalable and precise recommendation systems require in-depth understanding of user behaviours and typically involve machine learning models, there are ways to make recommendations based on simpler tools. In this blog, we will be using webscraping to answer the following questions:

What movie or TV shows share actors with your favorite movie or show?

To answer this question, we would be crawling through the movie database (TMDB) and forage shows/movies that share the same cast as your favourite movie/show to come up with a list of recommendations.

repository to this blogpost: ‘https://github.com/linnilinnil/Shows-recommendation-system’

Setting up

For demonstration, I would be picking an all-time classic —— The Godfather (part II). Let’s save the link for now: https://www.themoviedb.org/movie/240-the-godfather-part-ii and check out what the page displays. To examine the source code of the site, we can use inspect to look at what content is hosted under what section of the page.

Dry-run navigation

We will practice clicking through the navigation our scraper would be going through, try a few things in the scrapy shell, and look for potentially interesting elements on the page.

The link to the movie page could be decomposed into the following parts:
- https: hypertext transfer protocal secure. a set of rules that the transfer of data on the world wide web. - themoviedb: the domain name - org: domain extension, indicates what type of site the movie db is (organizational site) - 240-the-godfather-part-ii: directory, file path, address to specific resource locations within the site folders

Taking a closer look at each section and inspecting the source code, we see:

  • Top Billed cast: under <section class="panel top_billed scroller"> (note here there are three classes separated by space. one of them is top_billed which is unique.)
    The name of the cast can be found in the alternative text for the image (<img ... alt=" " ...), and the text for the second link (the first one is to the image)

  • Full Cast & Crew: by clicking this link, we would be transfered to a page showing the cast & crew list of the movie. The link is https://www.themoviedb.org/movie/240-the-godfather-part-ii/cast, indicating the content is put under a subfolder of the movie, /cast.

We keep traversing by going into Al Pacino’s actor page https://www.themoviedb.org/person/1158-al-pacino:

  • Known For: list a series of the most known works Al Pacino, including classcis like the the Godfather, Scarface (but not Scent of a Woman and Dog Day Afternoon).
    Under a div with the id “known_for”, where names of work are enclosed within the<bdi> tags.
  • Actor name: the name Al Pacino is mentioned throughout the page. One way to get the name is from parsing the title of the page: Al Pacino — The Movie Database (TMDB)
  • List of works: there is an option to filter for ‘Acting’ in the Deparment dropdown. We would end up with the link https://www.themoviedb.org/person/1158-al-pacino?credit_department=Acting which is the original link with ?credit_department=Acting in the end (additional argument pass along the request to the site)
  • Indiviaul work: put under table (one for each year) with the class name “credit_group”. The name of the work is enclosed within a pair of <bdi> tag.

Initializing the project

  1. Create a new GitHub repository, and sync it with GitHub Desktop. This repository will house your scraper. You should commit and push each time you make significant changes to your code. Use the following commands in the terminal.
cd dir    
git init    
git remote add origin [link to Github repo]    
git add .    
git commit -m [message if u want to add description to the commit]    
git push -u origin main  

Open a terminal in the location of your repository on your laptop, and type:

scrapy startproject TMDB_scraper
cd TMDB_scraper

You will notice the folder now looks like: TMDB_scraper/ scrapy.cfg # deploy configuration file

tutorial/             # project's Python module, you'll import your code from here
    __init__.py

    items.py          # project items definition file

    middlewares.py    # project middlewares file

    pipelines.py      # project pipelines file

    settings.py       # project settings file

    spiders/          # a directory where you'll later put your spiders
        __init__.py

Tweak settings of the web scraper

For now, add the following line to the file settings.py:

CLOSESPIDER_PAGECOUNT = 20

This line just prevents your scraper from downloading too much data while you’re still testing things out. You’ll remove this line later.

Hint: Running into 403 Forbidden error?
Sometimes a request gets 403 Forbidden error in response from the site. This occurs when you are not permitted to access a web page. There is nothing wrong with the connection, but the site’s server just says ‘No’ to you ¯_ (ツ)_/¯

To address this problem, let’s take a closer look at the settings.py file. Find the USER_AGENT section, you will see that the default setting will have you self identified as a scrapy bot, which would very likely get you rejected from the site. We can change this to a more ‘real user-like’ user agent. A more convenient way of configuring your identity is by modifying the header (override the DEFULAT_HEADER setting).

What is a request header and where can I find mine? A request header is an HTTP header that can be used in an HTTP request to provide information about the request context, so that the server can tailor the response. It would usually include specification of the user_agent.

To find the header your computer sends to the site, go to the Netowrk nab in the inspect window, and use keyboard shortcut to refresh the page. Search for User-Agent, and you will see (at least part of) your request header. Copy paste that to the settings file. As an example, I use:

DEFAULT_REQUEST_HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }

Creating your own scraper

Create a file inside the spiders directory called tmdb_spider.py. Add the following lines to the file.
Replace the entry of start_urls with the URL corresponding to your favorite movie or TV show.

# to run 
# scrapy crawl tmdb_spider -o movies.csv

import scrapy
class TMDB_Spider(scrapy.Spider):
    name = 'TMDB_Spider'
    start_urls = ['https://www.themoviedb.org/movie/240-the-godfather-part-ii']

Now, we will implement the three parsing methods for the TMDB_Spider class to parse the start page, the cast&crew page, and the individual actor/actress page individually.

parse(slef, response):

This function should assume that you start on a movie page, and then navigate to the Cast & Crew page. Remember that this page has url cast. (You are allowed to hardcode that part.)

def parse(self, response):
        """
        main parser that starts from the favourite movie page and goes to the cast page
        """
        
        #since we already know the pattern, we will just hardcode the page to the cast&crew
        cast_url = self.start_urls[0] + '/cast'
        #then we proceed to calling the parse_full_credits method with the new url
        yield scrapy.Request(url=cast_url, callback=self.parse_full_credits)
    

This methods does not return anything to the output csv file. We would just be jumping to the next parser for the cast&crew page. The scrapy.Request function representas an HTTP request, which is usually generated in a Spider and executed by the Downloader, thus generating a Response. The url would be the url to the Cast&Crew page, while the callback is the function that will be called with the response of this request (once it’s downloaded) as its first parameter. In this scenario, we would be using the parse_full_credits function which we will defined next to parse the Cast&Crew page

parse_full_credits(self, response)

This function should assume that you start on the Cast & Crew page. Its purpose is to yield a scrapy.Request for the page of each actor listed on the page. Crew members are not included. The yielded request should specify the method parse_actor_page(self, response) should be called when the actor’s page is reached. The parse_full_credits() method does not return any data. This method should be no more than 5 lines of code, excluding comments and docstrings.

def parse_full_credits(self, response):
    """
    cast&crew page parser that traverses through pages of all the cast of the movie
    """
    # the cast information is put under the ol (ordered lists) with class name "people" and "credits "
    # the link to individual cast page is put in <a href = [relative path]>
    castlinks = response.css("ol")[0].css("div.info").css("a::attr(href)").getall()
    # we can hardcode the link to filter for works the person act in
    # use list comprehension to generate the list of cast pages
    castlinks = [i+"?credit_department=Acting" for i in castlinks]
    casts = [response.urljoin(link) for link in castlinks]

    for cast in casts:
        yield scrapy.Request(cast, callback = self.parse_actor_page)

We mainly use the css selector and the urljoin method to get the links to the pages of individual cast members. After getting a list of links to the cast members, we traverse them one by one using a for loop. For each link, we would be calling the parse_actor_page function which we will define next to process the page of the actor/actress.

parse_actor_page(self, response):

This function should assume that you start on the page of an actor. It should yield a dictionary with two key-value pairs, of the form {“actor” : actor_name, “movie_or_TV_name” : movie_or_TV_name}. The method should yield one such dictionary for each of the movies or TV shows on which that actor has worked. Note that you will need to determine both the name of the actor and the name of each movie or TV show. This method should be no more than 15 lines of code, excluding comments and docstrings.

def parse_actor_page(self, response):
        """
       start on the page of actor, yield information about the actor's work
        """
        
        #get actor name: actor name is in the title of the page before the '-' dash symbol
        title = response.css("title::text").get()
        actor_name = title.split(" —")[0]

        # acting infromation is put under the first table with class 'card' and 'credits'
        # enclosed in <bdi> is the name of the work
        shows = response.css("table.credits")[0].css("bdi::text").getall()
        for work in shows:
            yield {"actor" : actor_name, "show_name" : work}

As discussed in the dry navigation section, we can find where to look for the information we want through examining html tags and css attributes through the inspect tool of the browser. This functions would actually return a dictionary containing the recommended work to the final csv. output.

Provided that these methods are correctly implemented, you can run the command (please run this in the project directory!)

scrapy crawl tmdb_spider -o results.csv

to create a .csv file with a column for actors and a column for movies or TV shows.

Once your spider is fully written, comment out the line

CLOSESPIDER_PAGECOUNT = 20

in the settings.py file to get the full results.

Experimentation in the scrapy shell:

Some of the following commands might be helpful for trying things:

#In terminal
scrapy shell https://www.themoviedb.org/movie/240-the-godfather-part-ii
#In scrapy shell
response.css('div').attrib
#to fetch a new url
fetch([url],redirect=True)

Ideas for building a slightly better rec system

  • We cam make the recommendations a little bit more accurate by only including only the top-billed cast by examining the top-billed casts section of the movie/show’s page. Or, excluding the uncredited cast members who only make minor/trivial appearances by filtering out cast members whose character description includes ‘(uncredited)’.

  • When examining individual cast’s work, we can only include TV shows that they are main characters in, and/or movies that they are top-billed casts in. (This would include examining movies/tv shows separately through the the ‘All’ drop-down menu and further visiting the page of individual work to get top-billed casts and episodes information).

  • We can also include the keywords section in our scraping so we can filter for work with similar keywords in the final results.

Making recommendations

To decrease the amount of scraping, I implement one of the improvement ideas from above - excluding the uncredited cast. The final code for the parse_full_credits method is as following:

def parse_full_credits(self, response):
        """
        cast&crew page parser that traverses through pages of all the cast of the movie
        """
        # the cast information is put under the ol (ordered lists) with class name "people" and "credits "
        # get characters information corresponding to each cast member from the <p> tag with class name character
        characters = response.css("ol")[0].css("p.character::text").getall()
        # the link to individual cast page is put in <a href = [relative path]>
        castlinks = response.css("ol")[0].css("div.info").css("a::attr(href)").getall()
        #use list comprehension to exclude uncredited cast members' pages
        creditedlinks = [castlinks[i] for i in range(len(castlinks)) if "(uncredited)" not
    ...:  in characters[i]]
        # we can hardcode the link to filter for works the person act in
        # use list comprehension to generate the list of cast pages
        creditedlinks = [i+"?credit_department=Acting" for i in creditedlinks]
        casts = [response.urljoin(link) for link in creditedlinks]
        for cast in casts:
            yield scrapy.Request(cast, callback = self.parse_actor_page)

Now, we read in the final lists of movies/shows.

import pandas as pd 
fav = "The Godfather Part II"
rec = pd.read_csv("/Users/linlin/Desktop/2023/16b/rec/TMDB_scraper/movies-updated.csv")
rec = rec[rec["show_name"] != fav]
rec.head()
actor show_name
0 Al Pacino Knox Goes Away
1 Al Pacino Sniff
2 Al Pacino Billy Knight
3 Al Pacino King Lear
4 Al Pacino Brad Pitt: More Than a Pretty Face

Once you’re happy with the operation of your spider, compute a sorted list with the top movies and TV shows that share actors with your favorite movie or TV show. For example, it may have two columns: one for “movie names” and “number of shared actors”.

sorted = rec.groupby("show_name").count().sort_values(['actor'], ascending=False)
picks = sorted.reset_index().head(15)
picks
show_name actor
0 Mario Puzo's The Godfather: The Complete Novel... 23
1 The Godfather 18
2 'The Godfather' Family: A Look Inside 8
3 Kojak 8
4 The Godfather Part III 7
5 Mannix 7
6 The Godfather Trilogy: 1901-1980 6
7 Tales from the Darkside 6
8 Becoming Al Pacino 6
9 The Rockford Files 6
10 The F.B.I. 6
11 Hill Street Blues 5
12 Jake and the Fatman 5
13 Night of 100 Stars 5
14 The Oscars 5
import plotly.io as pio
pio.renderers.default = "notebook_connected"
pl = px.histogram(picks,hover_name="show_name",x = "show_name",y="actor",
                  title = "Top recommendations with the most shared actors w/ 'The Godfather II'",
                    text_auto=True)
pl.update_layout(
    yaxis_title="num. of shared actors",
    xaxis_title="movie/tv name",
    margin={"r":0,"t":30,"l":0,"b":0},
                      font={"size" : 8}
)  
pl.update_traces(
    hovertemplate = None,
    hoverinfo = "skip"
)
pl.show()