code documentation

saheelahmed2 · Jan 18, 2019 · c6138eb · c6138eb
1 parent 95f11ba
commit c6138eb
Show file tree

Hide file tree

Showing 10 changed files with 510 additions and 3 deletions.
diff --git a/.ipynb_checkpoints/notebook-checkpoint.ipynb b/.ipynb_checkpoints/notebook-checkpoint.ipynb
@@ -0,0 +1,338 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Movie Recommendation System\n",
+    "&nbsp;\n",
+    "\n",
+    "* At some point each one of us must have wondered where all the recommendations that Netflix, Amazon, Google give us, come from. We often rate products on the internet and all the preferences we express and data we share (explicitly or not), are used by recommender systems to generate, in fact, recommendations. The two main types of recommender systems are either collaborative or content-based filters: these two names are pretty self-explanatory, but let’s look at a couple of examples to better understand the differences between them. I will use movies as an example (because if I could, I would be watching movies/tv shows all the time), but keep in mind that this type of process can be applied for any kind of product you watch, listen to, buy, and so on.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img src = 'images/movie.png'>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Why Recommendation System ?\n",
+    "\n",
+    "&nbsp;\n",
+    "&nbsp;\n",
+    "\n",
+    "* Here's the list of advantages:\n",
+    "    * Increase in sales due to better personalized offers.\n",
+    "    * Enhanced customer experience.\n",
+    "    * More time spent on the plateform.\n",
+    "    * Customer retention.\n",
+    "\n",
+    "&nbsp;\n",
+    "&nbsp;\n",
+    "\n",
+    "* A recent study by Epsilon found that 90% of consumers find personalization appealing. Plus, a further 80% claim they are more likely to do business with a company when offered personalized experiences.\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "* The study also found that these consumers are 10x more likely to become VIP customers, who make more than 15 purchases per year.\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "* The moral of the story? If you’re interested in cross selling or serving personalized offers, a recommendation system is right for you."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Recommendation at Netflix.com\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "<img src='images/nf.jpg'>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Types of Recommendations System\n",
+    "\n",
+    "&nbsp;\n",
+    "&nbsp;\n",
+    "\n",
+    "* Content Based Filtering\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "* Collaborative Filtering\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Content Based Filtering\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "<img src = 'cn.png' style=\"width:500px\">\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "\n",
+    "* Content based filtering uses characteristics or properties of an item to serve recommendations. Characteristic information includes:\n",
+    "\n",
+    "    * Characteristics of Items (Keywords and Attributes)\n",
+    "    * Characteristics of Users (Profile Information)\n",
+    "    \n",
+    "&nbsp;\n",
+    "\n",
+    "* Let’s use a movie recommendation system as an example. Characteristics for the item Harry Potter and the Sorcerer’s Stone might include:\n",
+    "\n",
+    "    * Director Name – Chris Columbus\n",
+    "    * Genres – Adventure, Fantasy, Family (IMDB)\n",
+    "    * Stars – Daniel Radcliffe, Rupert Grint, Emma Watson\n",
+    "    \n",
+    "&nbsp;\n",
+    "\n",
+    "<img src = 'content-based-filtering-harry-potter.png' style=\"width:800px\" >\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "* A content based recommender system can now serve the user:\n",
+    "    * More Harry Potter Movies\n",
+    "    * More Adventure, Family, or Fantasy Movies\n",
+    "    * More Chris Columbus Movies\n",
+    "    * More Daniel Radcliffe Movies\n",
+    "    \n",
+    "&nbsp;\n",
+    "\n",
+    "* Of course, the list is not exhaustive. Once the user makes choices, the recommender system can serve more targeted results.\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "* Let’s say they choose Swiss Army Man next. It’s safe to assume the user likes movies starring Daniel Radcliffe. The system tracks these choices and begins to recommend Daniel Radcliffe films.\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "* The system may also show the user more Harry Potter movies. The hypothesis is that if a user liked an item in the past, they might like similar items in the future.\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "* Content based filtering systems can also serve users items based on users’ profiles. You can create user profiles based on historical actions. You can also ask users upfront about their interests and preferences.\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "* Pro Tip: Using a hybrid recommender system allows you to combine elements of both systems. In general, that means elements of one system can remedy the pitfalls of the other.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "* In this project I am going to implement a basic content based recommendation system. The dataset used in this project is [Movielens](https://grouplens.org/datasets/movielens/100k/)  100k dataset.\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "#### About MovieLens Dataset\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "* One of the most common datasets that is available on the internet for building a Recommender System is the MovieLens DataSet. This version of the dataset that I'm working with (100k) contains 100,000 anonymous ratings of approximately 1700 movies made by 1000 MovieLens users who joined MovieLens in 2000.\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "* The data was collected by GroupLens researchers over various periods of time, depending on the size of the set. This 100k version was released on February 2003. Users were selected at random for inclusion. All users selected had rated at least 20 movies. Each user is represented by an id, and no other information is provided."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# importing libraries \n",
+    "import pandas as pd\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# importing user data from the zip file\n",
+    "user_cols = ['user_id','age','sex','occupation','zip_code']\n",
+    "users = pd.read_csv('ml-100k/u.user', sep='|', names = user_cols, encoding = 'latin-1')\n",
+    "\n",
+    "# importing movie ratings from the zip file\n",
+    "ratings_cols = ['user_id','movie_id','rating','unix_timestamp']\n",
+    "ratings = pd.read_csv('ml-100k/u.data', sep='\\t', names = ratings_cols, encoding = 'latin-1')\n",
+    "\n",
+    "# importing movies data from the zip file\n",
+    "movies_cols = ['movie_id','title','release_date','video_release_date','imdb_url']\n",
+    "movies = pd.read_csv('ml-100k/u.item', sep='|', names = movies_cols,usecols = range(5),\n",
+    "                     encoding = 'latin-1')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# importing genre dataset \n",
+    "genres_list = ['unknown','Action','Adventure','Animation','Children','Comedy','Crime',\n",
+    "               'Documentary','Drama','Fantasy','Film-Noir','Horror','Musical','Mystery',\n",
+    "               'Romance','Sci-Fi','Thriller','War','Western']\n",
+    "genre = pd.read_csv('ml-100k/u.item', sep='|',names = genres_list,usecols = range(5,24),encoding = 'latin-1')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# dropping redundant columns\n",
+    "movies.drop(['video_release_date','imdb_url'],inplace=True,axis = 1)\n",
+    "ratings.drop('unix_timestamp',axis = 1,inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# merge all the dataset into one whole dataset\n",
+    "dataset = pd.merge(pd.merge(movies, ratings),users)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# top 20 most rated movies\n",
+    "dataset[['title','rating']].sort_values('rating', ascending=False).head(20)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Cosine Similarity\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "* We all are familiar with vectors: they can be 2D, 3D or whatever-D. Let’s think in 2D for a moment, because it’s easier to picture in our mind, and let’s refresh the concept of dot product first. The dot product between two vectors is equal to the projection of one of them on the other. Therefore, the dot product between two identical vectors (i.e. with identical components) is equal to their squared module, while if the two are perpendicular (i.e. they do not share any directions), the dot product is zero. Generally, for n-dimensional vectors, the dot product can be calculated as shown below.\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "<img src ='cos.png' >\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "* The dot product is important when defining the similarity, as it is directly connected to it. The definition of similarity between two vectors u and v is, in fact, the ratio between their dot product and the product of their magnitudes.\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "<img src ='cos2.png' >\n",
+    "\n",
+    "&nbsp;\n",
+    "\n",
+    "* By applying the definition of similarity, this will be in fact equal to 1 if the two vectors are identical, and it will be 0 if the two are orthogonal. In other words, the similarity is a number bounded between 0 and 1 that tells us how much the two vectors are similar. Pretty straightforward!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Totol movies in terms of genre\n",
+    "genre.sum().sort_values(ascending=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Applying Cosine Similarity on genre dataset since it is already in sparse matrix form\n",
+    "from sklearn.metrics.pairwise import cosine_similarity\n",
+    "cosine_sim = cosine_similarity(genre)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Let's generate recommendation based on those similarity score of movies\n",
+    "\n",
+    "# first create a list of titles\n",
+    "titles = movies['title']\n",
+    "\n",
+    "# Create a series as key and value with key being movies title and value being it's indices\n",
+    "indices = pd.Series(movies.index,index=movies['title'])\n",
+    "\n",
+    "# creating the recommendation function\n",
+    "def movie_recommendation(title):\n",
+    "    #gets the index of the recieved title\n",
+    "    index = indices[title]\n",
+    "    #gets the similarity scores for the movies similar to the one at the index \n",
+    "    sim_scores = list(enumerate(cosine_sim[index]))\n",
+    "    # sorts the score in descending order\n",
+    "    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)\n",
+    "    # stores top 20 score\n",
+    "    sim_scores = sim_scores[1:21]\n",
+    "    # store the indices of the top 20 score\n",
+    "    movie_indices = [i[0] for i in sim_scores]\n",
+    "    # return the list of movies on that indices\n",
+    "    return titles.iloc[movie_indices]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "movie_recommendations('Toy Story (1995)').head(20)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.15rc1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/images/cn.png b/images/cn.png
diff --git a/images/coll.png b/images/coll.png
diff --git a/images/content-based-filtering-harry-potter.png b/images/content-based-filtering-harry-potter.png
diff --git a/images/cos.png b/images/cos.png
diff --git a/images/cos2.png b/images/cos2.png
diff --git a/images/movie.png b/images/movie.png
diff --git a/images/nf.jpg b/images/nf.jpg
diff --git a/images/nf2.jpg b/images/nf2.jpg