- Project Overview
- About DataSet
- Technologies
- Extracting, Transforming and Loading process
- Movie Recommendation using KNN
- Observasions.
- Reference
Knowledge-based, Content-based and Collaborative Recommender systems are built on MovieLens dataset with 100836 movie ratings/reviews. These Recommender systems were built using Pandas operations and by fitting KNN (K-Nearest Neighbours Algorithm), NCF (Neural Collaborative Filtering) & deep learning models which use NN (Neural Network) architecture to suggest movies for the users based on similar users and for queries specific to genre, user, movie, rating, popularity.
In this Project We're used Movie Lens Dataset. MovieLens is a rating dataset from the MovieLens website, which has been collected over some period. Stable benchmark dataset. 100836 ratings from 610 users on 9724 movies. Further information regarding this dataset can be found here.
This Dataset Consists of
- 100836 ratings from 610 users on 9724 movies.
- Each user has rated at least 20 movies.
- Simple info contains like( Movie Id, genre, movie Rating)
What to watch? With so many new and old movies out there we are spoiled for choice of what movies to watch next. But that choice leaves us spending more time looking for the next movie than actually watching it! It’s also hard to tell if we will even like the movie we settle on with just a blurb and a poster. Our goal is to fix this great problem of our time by creating a machine learning model that will help users find what movies to watch next based on a user's preferences.
To get a local copy up and running follow these simple example steps:
- Clone the reposirory by using Local Terminal Gitbash Gitbash Terminal
- using the Jupyter Notebook in which you performed the preprocessing steps.
-
sqlalchemy (SQl database)
- conda install -c anaconda sqlalchemy
- config
- pip install config
-
matplotlib.pyplot
- pip install matplotlib
-
tensorflow
- pip install tensorflow
-
numpy
- pip install Numpy
-
Scikit-learn Mechiene Learning Library
The data was provided to us in a CSV. We checked for any null values in the dataset, and did not have any. We also checked to verify that datatypes
matched the variables values as described above. Our data checked out in all these areas, so no additional transformation was required.
Then, we created an SQL database for our data to be stored in. We did this using our knowledge of SQL Database and Pandas.
In this process we're extracting data into dataframe:
- Connect to the local database. Here create a config.py file and keep your username and password in it and save the config.py file in .gitignore file to keep your username and password confidential. If it's not confidential, you can put it straight away in the code and you won't have to create config.py or .gitignore file then.
-
We're choose SQL, first use Spark on Colab to extract and transform the data and then load it into a SQL table on your RDS account. Perform analysis with SQL queries on RDS.
-
Join two tables in pgAdmin or join the two tables in with Pandas and SQLAlchemy.
-
The two dataset are successfully loaded into an RDS instance.
To run and view the KNN model, run the cells in the data_etl.ipynb after the ETL process has been completed (after cell 12) assuming all dependencies have been installed. The model is built by first preparing the data by transforming reviews of 4 stars or higher into values of 1 and reviews lower than 4 stars as 0. Then users with 10 reviews or less are removed from the dataset. After, the data is split into X (movie_id,user_id,timestamps) and y (ratings) for training and testing data. Then the data is scaled, and the KNN model is built, with an accuracy of 78.8%. The process of training the KNN model is also provided as a line graph.
Two more optimisations were performed on the dataset prior to building the data; the second and more successful optimisation had made reviews of 2.5 stars or more into 1 and reviews lower into 0, similar to the first iterations cleaning process. This had made the model roughly 7% more accurate providing a test score of 85.9%.
The neural network model is located in the “movie_recommendation_using_NN.ipynb” folder. The aim was to use collaborative filtering similar to that of the Kears model provided in their documentation [2]. As well as the model written about in “Neural Collaborative Filtering” [1] . Where users are recommended content based upon similar users preferences, as well as their own preferences.
To run this file simply open “movie_recommendation_using_NN.ipynb” and run all cells (assuming all dependencies are correctly installed). This process may take some time as there are multiple iterations of the neural collaborative filtering model (NCF).
The program loads in the data from the SQL database, then it cleans the data by creating a dataframe of only needed columns and prepares the data by enumerating it so it indexes at 0 for the machine learning model. As well as creating x and y variables, creating training and testing data and splitting the data. Then a sequential Keras model is built, that is optimized through 4 iterations, then the model visualizes a dataframe of the 10 movie recommendations for the user based on the user_id provided.
Here is an example of movie recommendations for user 314:
[1] "Neural Collaborative Filtering", acessed on the 18th Juanuary 2023. (https://arxiv.org/abs/1708.05031) [2] "Collaborative Filtering for Movie Recommendations", acessed on the 17th Juanuary 2023 (https://keras.io/examples/structured_data/collaborative_filtering_movielens/)