🎬 CINEMATCH

A Netflix-Style Movie Recommender System

Author: Jay Dayal Guwalani | Course: MSML602 - Data Science | University of Maryland

💡 Tip: Posters load automatically on GitHub Pages. Locally you will see placeholders.

1. Introduction and Motivation

1.1 The Problem of Choice Overload

In today's digital age, we face an unprecedented abundance of content. Netflix alone hosts over 15,000 titles, Amazon Prime Video offers more than 24,000 movies and shows, and new content is added daily. This phenomenon, known as choice overload or the paradox of choice (a term coined by psychologist Barry Schwartz in his 2004 book), can actually decrease user satisfaction and engagement. When faced with too many options, users often experience decision fatigue, leading to frustration and abandonment.

Recommendation systems solve this critical problem by filtering content based on user preferences, viewing history, and content similarity. These systems act as intelligent curators, presenting users with personalized suggestions that match their tastes. According to a McKinsey report, 35% of Amazon's revenue comes from its recommendation engine, and Netflix estimates that its recommender saves the company $1 billion annually by reducing subscriber churn.

1.2 Why This Matters for Data Science

Recommendation systems represent one of the most impactful applications of data science in the real world. They combine multiple disciplines including natural language processing, machine learning, information retrieval, and user behavior analysis. The famous Netflix Prize, a machine learning competition held from 2006 to 2009, offered $1 million to anyone who could improve Netflix's recommendation algorithm by 10%. This competition attracted thousands of teams worldwide and led to significant advances in collaborative filtering techniques.

1.3 Project Overview

In this tutorial, we build a content-based movie recommender system that mimics the "Because you watched..." feature found on Netflix. We walk through the complete data science pipeline: data curation, exploratory analysis, hypothesis testing, feature engineering, and machine learning implementation.

2. Data Curation and Parsing

This project uses the TMDB 5000 Movie Dataset from Kaggle, containing rich metadata about approximately 5,000 movies including cast, crew, genres, keywords, and plot summaries.

# Load the datasets movies = pd.read_csv('tmdb_5000_movies.csv') credits = pd.read_csv('tmdb_5000_credits.csv') # Merge on title movies = movies.merge(credits, on='title') movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']] print(f"Dataset shape: {movies.shape}") # Output: (4803, 7)

2.1 Data Quality Assessment

Before proceeding with analysis, we assessed data quality by checking for missing values, duplicates, and inconsistencies. Movies without plot overviews were removed as they are essential for our text-based similarity approach. After cleaning, we retained 4,806 movies for analysis.

3. Exploratory Data Analysis

Our EDA revealed several key insights about the movie dataset:

  • Genre Distribution: Drama is the most common genre (2,297 movies), followed by Comedy (1,722) and Thriller (1,274). Action and Romance round out the top five.
  • Rating Distribution: Movie ratings follow a roughly normal distribution with mean 6.1 and standard deviation 1.2. Very few movies have ratings below 3.0 or above 9.0.
  • Overview Length: Plot summaries average 35 words, with a median of 32 words. This provides sufficient text for meaningful similarity calculations.
  • Popularity Correlation: Strong positive correlation (r=0.78) exists between popularity scores and vote counts, validating that popular movies receive more engagement.

4. Hypothesis Testing

We validated our assumptions about the data using statistical hypothesis testing to ensure our approach is grounded in evidence.

Hypothesis 1: Overview Length vs. Rating

H0: There is no significant difference in overview length between high-rated and low-rated movies.

H1: High-rated movies have significantly different overview lengths than low-rated movies.

Method: Independent samples t-test

Result: t=2.34, p=0.019. We reject H0 at alpha=0.05, indicating high-rated movies tend to have slightly longer, more detailed plot descriptions.

Hypothesis 2: Popularity vs. Vote Count Correlation

H0: There is no correlation between popularity and vote count

H1: There is a significant correlation between popularity and vote count

Method: Pearson correlation coefficient

Result: r=0.78, p less than 0.001. Strong positive correlation confirmed, validating that our popularity metric is meaningful.

5. Feature Engineering

The core of our recommendation system is creating a meaningful numerical representation of each movie. We combine multiple text features into a single tags field that captures the essence of each movie:

  • Overview: Plot summary text, tokenized into individual words
  • Genres: Action, Comedy, Drama, Thriller, etc.
  • Keywords: Thematic tags from TMDB (e.g., time travel, dystopia)
  • Cast: Top 3 billed actors for each film
  • Crew: Director name (most influential crew member)
# Parse JSON-like columns def convert(obj): return [item['name'] for item in ast.literal_eval(obj)] def convert_top3(obj): return [item['name'] for item in ast.literal_eval(obj)[:3]] def fetch_director(obj): for item in ast.literal_eval(obj): if item['job'] == 'Director': return [item['name']] return [] # Combine all features into tags movies['tags'] = movies['overview'] + movies['genres'] + \ movies['keywords'] + movies['cast'] + movies['crew'] # Convert to lowercase string new_df['tags'] = new_df['tags'].apply(lambda x: " ".join(x).lower())

6. Machine Learning: Text Vectorization

We use CountVectorizer from scikit-learn to convert text into a bag-of-words representation. Each movie becomes a high-dimensional vector where each dimension represents the count of a specific word from our vocabulary.

from sklearn.feature_extraction.text import CountVectorizer # Create vectorizer with 5000 most common words cv = CountVectorizer(max_features=5000, stop_words='english') vectors = cv.fit_transform(new_df['tags']).toarray() print(f"Vector matrix shape: {vectors.shape}") # Output: (4806, 5000) # Each movie is now a 5000-dimensional vector!

7. Cosine Similarity

Cosine similarity measures the angle between two vectors, ranging from 0 (completely different) to 1 (identical). It is ideal for text comparison because it focuses on the direction of vectors rather than their magnitude, making it robust to document length differences.

cosine_similarity(A, B) = (A . B) / (||A|| x ||B||)

The dot product A.B measures how much the vectors point in the same direction, while the denominator normalizes by vector magnitudes.

from sklearn.metrics.pairwise import cosine_similarity # Compute similarity matrix similarity = cosine_similarity(vectors) print(f"Similarity matrix shape: {similarity.shape}") # Output: (4806, 4806) # This is a 4806 x 4806 matrix with ~23 million pairwise similarities!

8. Recommendation Function

The recommendation function finds the most similar movies to a given input by looking up its row in the similarity matrix and sorting by similarity score.

def recommend(movie, n=5): # Find movie index movie_index = new_df[new_df['title'] == movie].index[0] # Get similarity scores distances = similarity[movie_index] # Sort and get top n (excluding itself) movies_list = sorted( list(enumerate(distances)), reverse=True, key=lambda x: x[1] )[1:n+1] # Return titles and scores return [(new_df.iloc[i].title, round(score*100, 1)) for i, score in movies_list] # Example usage recommend('Batman Begins', n=5) # Output: [('The Dark Knight', 50.2%), ('Batman Returns', 40.9%), ...]

9. Key Insights and Results

  • Feature Combination Matters: Combining genres, cast, keywords, and overview creates much richer representations than any single feature alone. Movies with the same director AND similar genres show highest similarity.
  • Text Preprocessing is Critical: Removing spaces from multi-word names (e.g., ChristopherNolan) treats them as single tokens, dramatically improving matching for cast and crew.
  • Cosine Similarity is Robust: It handles varying overview lengths gracefully - a 100-word description can still match well with a 30-word one if they share key terms.
  • Sparsity is Expected: Average pairwise similarity is only ~0.03, which makes sense given the diversity of 4,806 movies across all genres and eras.
  • Franchise Detection: The algorithm excels at identifying movie franchises and sequels (e.g., all Batman movies cluster together with 30-50% similarity).

10. Limitations and Future Work

Current Limitations:

  • Cold Start Problem: New movies without metadata cannot receive or generate recommendations
  • No Personalization: The system recommends based on content, not individual user preferences
  • Popularity Bias: Well-documented movies have richer tags and may dominate recommendations
  • Language Limitation: English stop words removal may not work well for non-English content

Future Enhancements:

  • Implement collaborative filtering using user ratings for personalization
  • Use TF-IDF weighting instead of raw counts to reduce common word influence
  • Explore neural embeddings (Word2Vec, BERT) for semantic understanding
  • Add diversity constraints to avoid recommending only sequels

References and Further Reading

Technologies Used

  • Python 3.10 - Primary programming language
  • Pandas and NumPy - Data manipulation and numerical computing
  • Scikit-learn - Machine learning (CountVectorizer, cosine_similarity)
  • Matplotlib and Seaborn - Data visualization
  • SciPy - Statistical hypothesis testing
  • TMDB API - Real-time movie poster retrieval
  • HTML/CSS/JavaScript - Interactive web interface

Acknowledgments

This project was developed as part of MSML602 - Principles of Data Science at the University of Maryland. Special thanks to Professor Mohammad Nayeem Teli for guidance and the course materials that made this project possible.