This project implements a movie recommender system using TMDB movie data and includes an exploratory data analysis (EDA) of the dataset.
The movie data used for this project was downloaded from the TMDB Movie Metadata dataset on Kaggle.
This project also includes a deployed model on Hugging Face: Movie Recommender
This recommender system uses Python libraries. To run the application locally, you'll need to set up a virtual environment. Here are the instructions for creating a virtual environment:
- Open your terminal or command prompt.
- Install venv if you haven't already:
python -m ensurepip install venv
- Create a virtual environment named
movie_rec_env
:python -m venv movie_rec_env
- Activate the virtual environment:
- Windows:
movie_rec_env\Scripts\activate
- macOS/Linux:
source movie_rec_env/bin/activate
- Windows:
- Open your terminal or command prompt.
- Create a virtual environment named
movie_rec_env
:conda create -n movie_rec_env python=3.x
(Replace3.x
with your desired Python version) - Activate the virtual environment:
conda activate movie_rec_env
- Once your virtual environment is activated, navigate to the project directory in your terminal.
- Install the required libraries listed in
requirements.txt
:pip install -r requirements.txt
- After installing the dependencies, run the application using:
streamlit run app.py
- This will start the movie recommender system. You'll be able to interact with the application to receive movie recommendations based on your preferences.
- app.py: The main application file containing the Streamlit code for the movie recommender system.
- train.ipynb: Jupyter notebook containing the code for training the recommendation model.
- mini-projecteda.ipynb: Jupyter notebook containing the exploratory data analysis (EDA) of the TMDB movie dataset.
- movies.pkl: Pickle file containing the movies data.
- similarity.pkl: Pickle file containing the similarity matrix.
- requirements.txt: A file listing the required Python libraries.
- Loading Data: The application loads movie data and a similarity matrix from precomputed pickle files.
- User Input: Users can select a movie from a dropdown menu.
- Recommendations: Upon clicking the 'Recommend' button, the system fetches and displays the top 5 movie recommendations along with their posters using the TMDB API.
The training process is detailed in the train.ipynb
Jupyter notebook. Below is a summary of the steps involved:
- Data Loading: The TMDB movie and credits data are loaded and merged on the 'title' column.
- Data Cleaning:
- Dropping rows with missing values.
- Extracting relevant columns: movie_id, title, overview, genres, keywords, cast, crew.
- Converting genres, keywords, cast, and crew from string to list format for better manipulation.
- Data Preparation:
- Extracting the top 3 actors for the cast.
- Extracting the director from the crew.
- Converting the overview into a list of words.
- Removing spaces and converting to lowercase for uniformity.
- Feature Engineering:
- Combining overview, genres, keywords, cast, and crew into a single 'tags' column.
- Converting 'tags' into a single string.
- Applying stemming to reduce words to their root form.
- Vectorization: Using CountVectorizer to convert text data into numerical data (vectors) with a maximum of 5000 features and removing English stop words.
- Similarity Calculation: Computing cosine similarity between movie vectors to measure similarity.
- Recommendation Function: Creating a function to recommend movies based on cosine similarity.
- Model Saving: Saving the processed movie data and similarity matrix using pickle.
- Ensure you have Jupyter Notebook installed and your virtual environment activated.
- Open the
train.ipynb
file in Jupyter Notebook. - Run the notebook cells to train the model and save the outputs.
The EDA process is detailed in the mini-projecteda.ipynb
Jupyter notebook. Below is a summary of the steps involved:
- Data Loading: The TMDB movie and credits data are loaded into Pandas DataFrames.
- Data Examination:
- Viewing the first few rows of the datasets.
- Checking for missing values and duplicates.
- Descriptive statistics of the datasets.
- Data Cleaning:
- Dropping unnecessary columns.
- Handling missing values and duplicates.
- Converting data types for consistency.
- Feature Engineering:
- Extracting lead actor and crew size.
- Binning budget values.
- Calculating profit from revenue and budget.
- Extracting genre and production company names.
- Data Visualization:
- Bar plots for top 10 movies by cast and crew size.
- Distribution plots for budgets and revenues.
- Scatter plots for profit vs. budget.
- Line plots for the number of movies released per year.
- Heatmap for correlation matrix of numerical columns.
- Pair plots for numerical columns.
- Insights:
- Distribution of movie genres and production companies.
- Analysis of top revenue and profit-generating movies.
- Analysis of vote count and popularity by production company.
- Ensure you have Jupyter Notebook installed and your virtual environment activated.
- Open the
mini-projecteda.ipynb
file in Jupyter Notebook. - Run the notebook cells to perform the exploratory data analysis and visualize the insights.
Make sure to replace the placeholder api_key
in app.py
with your actual TMDB API key.
Enjoy your movie recommendations and data exploration!