A Netflix-Style Movie Recommender System
Author: Jay Dayal Guwalani | Course: MSML602 - Data Science | University of Maryland
In today's digital age, we face an unprecedented abundance of content. Netflix alone hosts over 15,000 titles, Amazon Prime Video offers more than 24,000 movies and shows, and new content is added daily. This phenomenon, known as choice overload or the paradox of choice (a term coined by psychologist Barry Schwartz in his 2004 book), can actually decrease user satisfaction and engagement. When faced with too many options, users often experience decision fatigue, leading to frustration and abandonment.
Recommendation systems solve this critical problem by filtering content based on user preferences, viewing history, and content similarity. These systems act as intelligent curators, presenting users with personalized suggestions that match their tastes. According to a McKinsey report, 35% of Amazon's revenue comes from its recommendation engine, and Netflix estimates that its recommender saves the company $1 billion annually by reducing subscriber churn.
Recommendation systems represent one of the most impactful applications of data science in the real world. They combine multiple disciplines including natural language processing, machine learning, information retrieval, and user behavior analysis. The famous Netflix Prize, a machine learning competition held from 2006 to 2009, offered $1 million to anyone who could improve Netflix's recommendation algorithm by 10%. This competition attracted thousands of teams worldwide and led to significant advances in collaborative filtering techniques.
In this tutorial, we build a content-based movie recommender system that mimics the "Because you watched..." feature found on Netflix. We walk through the complete data science pipeline: data curation, exploratory analysis, hypothesis testing, feature engineering, and machine learning implementation.
This project uses the TMDB 5000 Movie Dataset from Kaggle, containing rich metadata about approximately 5,000 movies including cast, crew, genres, keywords, and plot summaries.
Before proceeding with analysis, we assessed data quality by checking for missing values, duplicates, and inconsistencies. Movies without plot overviews were removed as they are essential for our text-based similarity approach. After cleaning, we retained 4,806 movies for analysis.
Our EDA revealed several key insights about the movie dataset:
We validated our assumptions about the data using statistical hypothesis testing to ensure our approach is grounded in evidence.
H0: There is no significant difference in overview length between high-rated and low-rated movies.
H1: High-rated movies have significantly different overview lengths than low-rated movies.
Method: Independent samples t-test
Result: t=2.34, p=0.019. We reject H0 at alpha=0.05, indicating high-rated movies tend to have slightly longer, more detailed plot descriptions.
H0: There is no correlation between popularity and vote count
H1: There is a significant correlation between popularity and vote count
Method: Pearson correlation coefficient
Result: r=0.78, p less than 0.001. Strong positive correlation confirmed, validating that our popularity metric is meaningful.
The core of our recommendation system is creating a meaningful numerical representation of each movie. We combine multiple text features into a single tags field that captures the essence of each movie:
We use CountVectorizer from scikit-learn to convert text into a bag-of-words representation. Each movie becomes a high-dimensional vector where each dimension represents the count of a specific word from our vocabulary.
Cosine similarity measures the angle between two vectors, ranging from 0 (completely different) to 1 (identical). It is ideal for text comparison because it focuses on the direction of vectors rather than their magnitude, making it robust to document length differences.
The dot product A.B measures how much the vectors point in the same direction, while the denominator normalizes by vector magnitudes.
The recommendation function finds the most similar movies to a given input by looking up its row in the similarity matrix and sorting by similarity score.
Current Limitations:
Future Enhancements:
This project was developed as part of MSML602 - Principles of Data Science at the University of Maryland. Special thanks to Professor Mohammad Nayeem Teli for guidance and the course materials that made this project possible.