Building a Movie Recommendation Engine for Beginners

How to make a remarkably simple and accurate movie suggester using Python Pandas

Photo by Karen Zhao on Unsplash

Have you ever wondered how Netflix manages to recommend so many apparently random movies that all somehow fit your preferences, in one way or another? The answer is data science — more specifically, big data, machine learning, and artificial intelligence.

In this article, we won’t be exploring machine learning, AI, or Netflix’s complex and intricate recommendation system. What we’ll be doing is designing a movie recommendation system with easy-to-follow steps, to give you some hands-on experience of vast world of data science.

Content

Introduction to Data Analysis: Movie Recommendations

The main component of our movie recommendation system relies on a learning concept called collaborative filtering. Collaborative filtering bases its suggestions only on users’ past data and preferences, mostly in the form of reviews (although there are other methods of gathering user preferences).

To understand this, I will illustrate a user-user example of collaborative filtering’s nearest neighborhood algorithm. Say I enjoyed watching movie A, movie B, and movie C, and my friend enjoyed watching movie A, movie B, and movie D. The collaborative filtering algorithm will most likely suggest that I will enjoy watching movie D, and my friend will enjoy watching movie C based on our previous positive preferences. It makes sense that I and my friend will enjoy our movie recommendations because we share similar preferences. Of course, this isn’t always the case, but the odds become increasingly in our favor when we start analyzing larger datasets of user reviews.

An example of collaborative filtering.

One method of finding the similarity between the two items is the Pearson correlation coefficient. The Pearson correlation coefficient gives us a number (least similar: -1; most similar: 1) that represents the similarity between two items. There are many other ways to find correlation values, but in this lesson, we’ll be only exploring and focusing on the Pearson Correlation coefficient.

The formula for computing the Pearson Correlation coefficient.

Fortunately for us, we have a helpful function in Pandas called corrwith() that can actually calculate the Pearson correlation coefficient—granted that we have the data in the correct format. We’ll use this function to perform the majority of our statistical analysis on our large datasets.

Our project will use both the Pearson correlation coefficient and collaborative filtering’s nearest neighborhood algorithm to construct a reliable and successful recommendation system.

A special thank you and shout out to Pulkit Sharma and his amazing guide on Recommendation Engines.

Installing Anaconda-Navigator and Importing Pandas

To start constructing our movie recommendation system, begin by installing the Anaconda-Navigator linked here. I’ll be downloading the 64-Bit Graphical Installer for macOS, however, you should download the correct installer based on your preferences and computer. The Anaconda-Navigator allows you to work with many top-notch computing applications, such as JupyterLab, Jupyter Notebook, Spyder, Gluevis, and many more. For our project, we’ll be doing all problem-solving and programming in JupyterLab.

When you finish the installation, open up the Anaconda-Navigator. The main screen of the navigator should look something like this:

Anaconda-Navigator’s home screen for macOS

Next, we install Pandas — a Python software library that has useful methods for data analysis. There are multiple ways of installing Pandas — through Anaconda, Miniconda, etc. — however, I will be using the Terminal and pip to install Pandas on my computer. If you haven’t installed pip, check out this article for a guide.

Start by opening up terminal and typing in the following command:

pip install pandas

This should initiate the installation of the Pandas python package onto your computer. It should take a few minutes for the package to install. When you’ve finished, congratulate yourself — you’re one step closer to building your movie recommendation system.

Getting Familiar with JuyterLab

Our second step of constructing this movie recommendation system is done all in JupyterLab. Therefore, we must begin by launching JupyterLab from the Anaconda-Navigator’s home screen.

You should be met with a page like this.

Make a new empty folder by clicking the second top-left icon (that combines a folder and a plus sign). Navigate to the new folder’s contents — it should be empty and have the name “Untitled Folder”. Click on the Python 2 button under “Notebook” to create a new Untitled.ipynb.

Click on the Python 2 button to create a new Untitled.ipynb.

Rename your Untitled Folder and your Untitled.ipynbfile to make your workspace more organized. I will be naming my folder “movieRecommender” and my .ipynbfile “MovieSuggester”.

In your new folder, you should have an .ipynb file with no code written on it so far. Take some time to experiment around with JupyterLab and get used to the environment. When you’re finished, we can move to the next step.

Visualizing Data and Programming in JupyterLab

The third step is when we perform all of our statistical analysis and coding, and it’s the most important step of our project.

Before we start, I want to mention the type of learning I strive for in this project. I’ve noticed that sometimes the “MovieLens 25M Dataset” suggests slightly different movies depending on when you download and import it to JupyterLab. In addition, the datasets I gathered may occasionally be updated by MovieLens themselves. A “MovieLens 25M Dataset” right now may be very different from a “MovieLens 25M Dataset” in the future. So you should focus on the foundational concepts being taught in this lesson and how each piece builds with the others, instead of focusing on whether your DataFrames look exactly like the DataFrames below. Just keep this information in mind and let’s start building!

Begin by downloading a MovieLens dataset from here. The dataset I’m downloading and using is the “MovieLens 25M Dataset” which includes 25 million reviews with the most recent data from 2019. This dataset will allow my program to make the most accurate and up-to-date movie suggestions. However, the data you chose is totally up to you and based on what your computer can and can’t handle. (Make sure you read and fill out the Google Form to request and gain permission to use the data from MovieLens.)

When you have your dataset downloaded onto your computer, import the movies.csv and ratings.csv to JupyterLab.

After you import movies.csv and ratings.csv, it should look like this.

Now, we can import Pandas to the .ipynb file, giving us access to all of the functions and methods found in Pandas. We do this by writing the line of code below onto the first line of your .ipynb file.

import pandas as pd

After importing Pandas, we can read the data that we downloaded from MovieLens. Using the read_csv() function from Pandas, we are able to set specific variables to represent our imported datasets. I will be using titles and data to represent my data.

titles = pd.read_csv("movies.csv")
data = pd.read_csv("ratings.csv")

Let’s look at how our datasets are organized. We can do this with the following functions:

titles.head(10)
data.head(10)

titles and data should look like the images below, respectively.

Left: Titles. Right: data. We can see that ‘title’ has three unique columns and ‘data’ has four unique columns.

Great. We can see that our titles have some important information that data does not and our data has some important information that titles does not have. For instance, data lacks the columns title and genres whereas titles lacks the columns ratings, userId, and timestamps. Consequently, it would make sense for us to combine both titles and data in order to form a dataset with all the necessary information. We can do this with the following line code, setting data to be the dataset with all the information we need.

data = pd.merge(titles, data, on = "movieId")

Below is an example of what my data looks like. (We can use the function data.head(10) again to see what the first ten rows of our data look like.)

Notice how we have movieId, title, genres, userId, rating, and timestamp. We can see that ‘data’ has 6 different columns now.

Let’s visualize the user ratings and see if we can calculate each movie’s average rating and how many people have rated each movie. Both the number of ratings and the average ratings will be crucial in helping us sort out outliers in the future and make our suggestions more accurate.

The following line of code does all of the counting, adding, and averaging for us. Finally, we can put all of this valuable information in a DataFrame for future use. I’ll just round my mean ratings to one decimal point to represent the real-world movie rating systems:

reviews = data.groupby('title')['rating'].agg(['count','mean']).reset_index().round(1)

My DataFrame of ratings for the first ten rows:

‘count’ acts as our total number of reviews for each movie and mean is the ‘average’ ratings for each movie.

Great — we now have a couple of DataFrames with valuable information. Let’s move on to the computational part of this project.

All of the parts in step four revolve around the corrwith() function in Pandas. The corrwith() function is extremely useful for us since it can calculate the correlation between columns of two DataFrames.

To start, we must first make a DataFrame which consists of movie titles as the column and userId as the rows with the values of the DataFrame being the ratings of each viewer.

The reason why we do this is mainly due to the nature of the corrwith() function—but more specifically, the Pearson correlation coefficient. As stated above, using the Pearson correlation coefficients, the corrwith() function can calculate the similarity between two DataFrame columns. Thus, we want the movie titles as columns so that we can take one column (which represents one movie) and compare it with the others. We also need to have user ratings as the values of our DataFrame because our final goal is to compare the similarity between how each user rated the different movies, and then use the highest similarities to suggest movie recommendations.

The code below will create such DataFrame. This line of code may take a while for the computer to execute as we are taking all of the 25 million movie reviews and all of the movie titles in our dataset and combining them to creating a whole new DataFrame (of course this depends on what computer you have and what dataset you’re working with). For me, it took around five minutes maximum for my first time making this DataFrame.

movies = pd.crosstab(data['userId'], data['title'], values = data['rating'], aggfunc = 'sum')

Now, we can input three of our most favorite movies to start generating some correlations between movies. Technically, you can choose whatever number of movies you want for your most favorite movies. However, I would recommend only choosing a maximum of five movies that have similar genres so the algorithm can perform optimally.

I used an array of strings which keeps my code to a minimum as opposed to setting three of my favorite movies to three separate strings.

userInput = ["Inception (2010)", "Interstellar (2014)", "Arrival (2016)"]

Inception (2010), Interstellar (2014), and Arrival (2016) were my top three most favorite movies of all-time. Notice that we need to put the release date of the movie at the end. Essentially, we’re mimicking the format of the movie names (or column names) in our movies DataFrame so that the computer can accurately find the right movie column based on our name.

Let’s start calculating some correlations using the corrwith() function. It begins by finding the correlation of our first inputted movie and the first movie column in the movies DataFrame. Then it repeats this process for all of the movie titles in the movies DataFrame — in other words, we’ll repeat this process for every movie in our MovieLens dataset. You might get a runtime warning in your code, but for now, we can ignore that, and everything will still run perfectly fine.

Don’t forget to specify the type of method of correlation you want. As I said, we’ll be using the Pearson correlation coefficient for this project, but there are other methods of calculating correlation such as Kendall Tau’s and Spearman Rank’s correlation coefficient. There is a very detailed article that describes these methods of statistical correlation if you’re interested or still confused about the Pearson correlation coefficient.

movies.corrwith(movies[userInput[0]], method = 'pearson')

Eventually, we will have a table of all of the movie names and a correlation value next to it. The table below has the similarity of every movie compared to our first most favorite movie — in my case, Inception (2010). Even though the table below technically contains all of the movies from our dataset, I’ll only show the correlation values of the first ten movies the sake of simplicity.

NaN means that there is nothing for the computer to correlate/and compare with.

We repeat this process twice more — once for each of our inputted favorite movies. Furthermore, we can sum up the total correlation values for each inputted movie to find every movie’s total correlation to each of our most favorite movies. This part of the code may be slower to execute as the computer needs to do massive amounts of computations.

similarity = movies.corrwith(movies[userInput[0]], method = 'pearson') + movies.corrwith(movies[userInput[1]], method = 'pearson') + movies.corrwith(movies[userInput[2]], method = 'pearson')

With this similarity, we can create a DataFrame with all of the movie titles, correlation values, total number of ratings, average ratings, movieId, and genres. The merge() function will be helpful for this task.

correlatedMovies = pd.DataFrame(similarity, columns = ['correlation'])
correlatedMovies = pd.merge(correlatedMovies, reviews, on = 'title')
correlatedMovies = pd.merge(correlatedMovies, titles, on = 'title')
My ‘correlatedMovies’ DataFrame with all of the columns with important values.

Great work! You’ve finished all the hard parts of this lesson. The final step is to make use of the data that we extracted.

Cleaning the Recommendation List

As mentioned above, the final step of this project is to make sure our movie recommendation system actually recommends accurate movie suggestions. To do this, I chose to get rid of movies that are under 3.5 stars and movies that have less than 300 reviews. This part is completely up to you — change the numbers to whatever you want.

Finally, we can order the correlations from highest to lowest by adding ascending = False to our code, since we want the most similar movies recommended first.

output = correlatedMovies[(correlatedMovies['mean'] > 3.5) & (correlatedMovies['count'] >= 300)].sort_values('correlation', ascending = False)

Now, let’s see what the first 25 rows of output look like:

Remember to use the head() function to view the data.

Interesting. It seems that two of the movies I put into this algorithm as my top three favorite movies of all time became the movie recommendation system’s first and second suggestions. Although this makes sense for the computer, we don’t want the algorithm to recommend movies that we’ve already watched and liked. To fix this, we write and execute the following line of code:

output = output[((output.title != userInput[0]) & (output.title != userInput[1]) & (output.title != userInput[2]))]

You might be wondering why some movies have a correlation value of over one. Didn’t I say that the Pearson correlation coefficient can only be a value between -1 and 1? Well, that’s true, but we have to remember that I’m adding the total correlations of three different movies. Therefore, our correlation values actually span from -3 to 3.

In addition, I rename the column titles of my output and get rid of the correlation and movieId columns to make my table look cleaner:

del output['movieId']
del output['correlation']
output.rename(columns={"title": ("Movies Suggestions based on " + userInput[0] + ', ' + userInput[1] + ', ' + userInput[2]) , "count": "Number of Ratings", "genres": "Genres", "mean": "Ratings", "correlation": "Correlation"}).head(25)

My final output is something along the lines of this:

Awesome! I can tell that this movie recommendation system did an excellent job because I’ve already watched and loved some of the movies that this algorithm recommended to me (Shutter Island, The Prestige, District 9, The Dark Knight Rises, Aliens, the Martian, and Avatar).

Again, keep in mind, I’m almost certain your DataFrame is going to be different from mine even if you chose to input the same favorite movies as me. Don’t worry about this as there are many factors that could have played a role in this situation. Instead, focus on why we did what we did. Focus on the concepts that were taught and used. This is where true learning develops and progresses.

Future Improvisations and Final Observations

Congratulations on creating your recommendation engine!

Like many things in life, there are countless improvements that we can make. For instance, even though I only inputted sci-fi and action-adventure movies, I noticed there were a couple of comedy and romance movies that this movie recommendation system suggested for me. Sci-fi and action-adventure seem to be the opposite of comedy and romance, so I could perhaps add a function where I look at the genres of each movie and recommend movies that are closely related to sci-fi and action-adventure (or whatever genres your inputted movies were). This could be done by splitting the types of genres each movie falls under and putting those genres in an array (Drama, mystery, war, for example). Then you could loop through each movies genres array and keep the sci-fi genres, action-adventure genres, and genres that are related to either sci-fi or action-adventure.

Furthermore, we could use other methods of calculating correlation, such as cosine similarity. If you’re interested in learning more about cosine similarity and its role in a recommendation system, there’s a very detailed article that goes through the process of creating a movie recommendation system using cosine similarity here.

Last but not least, we have to review and discuss our algorithm’s runtime, efficiency, and effectiveness on large datasets. It takes a lot of time, space, and computer effort to calculate Pearson correlation coefficients for every movie in our database (keep in mind our dataset has tens of thousands of movies) which is the huge trade-off to creating this recommendation system. Throughout our guide, there were multiple times where we had to wait a decent amount of time for our computer to perform its task on our large datasets (especially forming new DataFrames and calculating correlations).

So, this sort of movie recommendation engine wouldn’t work well in the real world and quite frankly, isn’t practical. However, our movie recommendation system does give us a glimpse of what data science is all about!

Thanks for making it this far! If you have any other questions or concerns, feel free to leave a comment below.

Helpful References