ChicagoBoothML · thomashjudge · Apr 18, 2018 · Apr 18, 2018
diff --git a/Book Recommendation/Python/ParseData.py b/Book Recommendation/Python/ParseData.py
@@ -0,0 +1,53 @@
+from pandas import read_csv
+from os.path import join
+
+from ChicagoBoothML_Helpy.Print import printflush
+
+
+def parse_book_crossing_data(
+        data_path='https://raw.githubusercontent.com/ChicagoBoothML/DATA___BookCrossing/master'):
+
+    # Common NAs:
+    na_strings = [
+        '',
+        'na', 'n.a', 'n.a.',
+        'nan', 'n.a.n', 'n.a.n.',
+        'NA', 'N.A', 'N.A.',
+        'NaN', 'N.a.N', 'N.a.N.',
+        'NAN', 'N.A.N', 'N.A.N.',
+        'nil', 'Nil', 'NIL',
+        'null', 'Null', 'NULL']
+
+    printflush('Parsing Books...', end=' ')
+    books = read_csv(
+        join(data_path, 'BX-Books.csv'),
+        sep=';',
+        dtype=str,
+        na_values=na_strings,
+        usecols=['ISBN', 'Book-Title', 'Book-Author'],
+        error_bad_lines=False)
+    printflush('done!')
+
+    printflush('Parsing Users...', end=' ')
+    users = read_csv(
+        join(data_path, 'BX-Users.csv'),
+        sep=';',
+        dtype=str,
+        na_values=na_strings,
+        error_bad_lines=False)
+    users['User-ID'] = users['User-ID'].astype(int)
+    users['Age'] = users['Age'].astype(float)
+    printflush('done!')
+
+    printflush('Parsing Ratings...', end=' ')
+    ratings = read_csv(
+        join(data_path, 'BX-Book-Ratings.csv'),
+        sep=';',
+        dtype=str,
+        na_values=na_strings,
+        error_bad_lines=False)
+    ratings['User-ID'] = ratings['User-ID'].astype(int)
+    ratings['Book-Rating'] = ratings['Book-Rating'].astype(float)
+    printflush('done!')
+
+    return dict(books=books, users=users, ratings=ratings)
diff --git a/Book Recommendation/R/BookRecommendation.Rmd b/Book Recommendation/R/BookRecommendation.Rmd
@@ -0,0 +1,305 @@
+---
+title: "Recommendation Engine example: on Book Crossing data set"
+author: 'Chicago Booth ML Team'
+output: pdf_document
+fontsize: 12
+geometry: margin=0.6in
+---
+
+
+_**Note:** the current script suffers from memory exhaustion even for a good laptop. With more memory, the code could have been written like below. We'll update this once we've found a way around._
+
+
+This script uses the [**Book Crossing**](http://www2.informatik.uni-freiburg.de/~cziegler/BX) data set to illustrate personalized recommendation algorithms.
+
+_**Note**: as of the date of writing, the **`recommenderlab`** package's default implementation of the Latent-Factor Collaborative Filtering / Singular Value Decomposition (SVD) method has a bug and produces dodgy performances. We'll not cover this method for now._
+
+
+# Load Libraries & Helper Modules
+
+```{r message=FALSE, warning=FALSE}
+# load RecommenderLab package
+library(recommenderlab)
+
+# import data parser
+source('https://raw.githubusercontent.com/ChicagoBoothML/MachineLearning_Fall2015/master/Programming%20Scripts/Book%20Recommendation/R/ParseData.R')
+```
+
+
+# Data Import & Preprocessing
+
+```{r}
+data <- parse_book_crossing_data()
+books <- data$books
+users <- data$users
+ratings <- data$ratings
+ratings[ , `:=`(user_id = factor(user_id),
+                isbn = factor(isbn))]
+```
+
+Let's examine the number of ratings per user and per book:
+
+```{r}
+nb_ratings_per_user <-
+  dcast(ratings, user_id ~ ., fun.aggregate=length, value.var='book_rating')
+
+nb_ratings_per_book <-
+  dcast(ratings, isbn ~ ., fun.aggregate=length, value.var='book_rating')
+```
+
+Each user has rated from `r formatC(min(nb_ratings_per_user$.), big.mark=',')` to `r formatC(max(nb_ratings_per_user$.), big.mark=',')` books, and each book has been rated by from `r formatC(min(nb_ratings_per_book$.), big.mark=',')` to `r formatC(max(nb_ratings_per_book$.), big.mark=',')` users.
+
+```{r}
+min_nb_ratings_per_user <- 6
+```
+
+Let's remove the users with fewer than `r min_nb_ratings_per_user` to make the recommendation more rigorous:
+
+```{r}
+users_with_enough_nb_ratings <- nb_ratings_per_user[. >= min_nb_ratings_per_user, user_id]
+
+ratings <- ratings[user_id %in% users_with_enough_nb_ratings, ]
+```
+
+Let's now convert the **`ratings`** to a RecommenderLab-format Real-Valued Rating Matrix:
+
+```{r}
+ratings <- as(ratings, 'realRatingMatrix')
+
+ratings
+```
+
+
+# Split Ratings Data into Training & Test sets
+
+Let's now establish a RecommenderLab Evaluation Scheme, which involves splitting the **`ratings`** into a Training set and a Test set:
+
+```{r}
+train_proportion <- .5
+nb_of_given_ratings_per_test_user <- 3
+
+evaluation_scheme <- evaluationScheme(
+  ratings, 
+  method='split',
+  train=train_proportion,
+  k=1,
+  given=nb_of_given_ratings_per_test_user)
+
+evaluation_scheme
+```
+
+The data sets split out are as follows:
+
+- Training data:
+
+```{r}
+ratings_train <- getData(evaluation_scheme, 'train')
+
+ratings_train
+```
+
+- Test data: "known"/"given" ratings:
+
+```{r}
+ratings_test_known <- getData(evaluation_scheme, 'known')
+
+ratings_test_known
+```
+
+- Test data: "unknown" ratings to be predicted and evaluated against:
+
+```{r}
+ratings_test_unknown <- getData(evaluation_scheme, 'unknown')
+
+ratings_test_unknown
+```
+
+
+# Recommendation Models
+
+Let's now train a number of recommendation models. The methods available in the **`recommenderlab`** package are:
+
+```{r}
+recommenderRegistry$get_entry_names()
+```
+
+The descriptions and default parameters for the methods applicable to a Real Rating Matrix are as follows:
+
+```{r}
+recommenderRegistry$get_entries(dataType='realRatingMatrix')
+```
+
+
+## Popularity-Based Recommender
+
+The description and default parameters of this method in **`recommenderlab`** are as follows:
+
+```{r}
+recommenderRegistry$get_entry('POPULAR', dataType='realRatingMatrix')
+```
+
+We train a popularity-based recommender as follows:
+
+```{r}
+popular_rec <- Recommender(
+  data=ratings_train,
+  method='POPULAR')
+
+popular_rec
+```
+
+
+## User-Based Collaborative-Filtering Recommender
+
+User-Based Collaborative Filtering("**UBCF**") assumes that users with similar preferences will rate items similarly. Thus missing ratings for a user can be predicted by first finding a _**neighborhood**_ of similar users and then aggregate the ratings of these users to form a prediction.
+
+The description and default parameters of this method in **`recommenderlab`** are as follows:
+
+```{r}
+recommenderRegistry$get_entry('UBCF', dataType='realRatingMatrix')
+```
+
+We train a UBCF recommender as follows:
+
+```{r}
+# User-based Collaborative Filtering Recommender
+user_based_cofi_rec <- Recommender(
+  data=ratings_train,
+  method='UBCF',           # User-Based Collaborative Filtering
+  parameter=list(
+    normalize='center',    # normalizing by subtracting average rating per user;
+                           # note that we don't scale by standard deviations here;
+                           # we are assuming people rate on the same scale but have
+                           # different biases
+    method='Pearson',      # use Pearson correlation
+    nn=30                  # number of Nearest Neighbors for calibration
+  ))
+
+user_based_cofi_rec
+```
+
+
+## Item-Based Collaborative-Filtering Recommender
+
+Item-Based Collaborative Filtering ("**IBCF**") is a model-based approach which produces recommendations based on the relationship between items inferred from the rating matrix. The assumption behind this approach is that users will prefer items that are similar to other items they like.
+
+The model-building step consists of calculating a similarity matrix containing all item-to-item
+similarities using a given similarity measure. Popular measures are Pearson correlation and
+Cosine similarity. For each item only a list of the $k$ most similar items and their similarity values are stored. The $k$ items which are most similar to item $i$ is denoted by the set $S(i)$ which can be seen as the neighborhood of size $k$ of the item. Retaining only $k$ similarities per item improves the space and time complexity significantly but potentially sacrifices some recommendation quality.
+
+The description and default parameters of this method in **`recommenderlab`** are as follows:
+
+```{r}
+recommenderRegistry$get_entry('IBCF', dataType='realRatingMatrix')
+```
+
+We train a IBCF recommender as follows:
+
+```{r}
+# Item-based Collaborative Filtering Recommender
+item_based_cofi_rec <- Recommender(
+  data=ratings_train,
+  method='IBCF',           # Item-Based Collaborative Filtering
+  parameter=list(
+    normalize='center',    # normalizing by subtracting average rating per user;
+                           # note that we don't scale by standard deviations here;
+                           # we are assuming people rate on the same scale but have
+                           # different biases
+    method='Pearson',      # use Pearson correlation
+    k=100                  # number of Nearest Neighbors for calibration
+  ))
+
+item_based_cofi_rec
+```
+
+
+## Latent-Factor Collaborative-Filtering Recommender
+
+_**Note**: as of the date of writing, the **`recommenderlab`** package's default implementation of the Latent-Factor Collaborative Filtering / Singular Value Decomposition (SVD) method has a bug and produces dodgy performances. The code in this section is commented out for now._
+
+This approache uses Singular-Value Decomposition (SVD) to factor the Rating Matrix into a product of user-feature and item-feature matrices.
+
+The description and default parameters of this method in **`recommenderlab`** are as follows:
+
+```{r}
+recommenderRegistry$get_entry('SVD', dataType='realRatingMatrix')
+```
+
+We train a Latent-Factor CF recommender as follows:
+
+```{r}
+# Latent-Factor Collaborative Filtering Recommender
+# with matrix factorization by Singular-Value Decomposition (SVD)
+# latent_factor_cofi_rec <- Recommender(
+#   data=ratings_train,
+#   method='SVD',            # Item-Based Collaborative Filtering
+#   parameter=list(
+#     categories=30,         # number of latent factors
+#     normalize='center',    # normalizing by subtracting average rating per user;
+                           # note that we don't scale by standard deviations here;
+                           # we are assuming people rate on the same scale but have
+                           # different biases
+#     method='Pearson'       # use Pearson correlation
+#   ))
+
+# latent_factor_cofi_rec
+```
+
+
+# Model Evaluation
+
+Now, we make predictions on the Test set and and evaluate these recommenders' OOS performances:
+
+```{r}
+popular_rec_pred <- predict(
+  popular_rec,
+  ratings_test_known,
+  type='ratings')
+
+popular_rec_pred_acc <- calcPredictionAccuracy(
+  popular_rec_pred,
+  ratings_test_unknown)
+
+popular_rec_pred_acc
+```
+
+```{r}
+user_based_cofi_rec_pred <- predict(
+  user_based_cofi_rec,
+  ratings_test_known,
+  type='ratings')
+
+user_based_cofi_rec_pred_acc <- calcPredictionAccuracy(
+  user_based_cofi_rec_pred,
+  ratings_test_unknown)
+
+user_based_cofi_rec_pred_acc
+```
+
+```{r}
+item_based_cofi_rec_pred <- predict(
+  item_based_cofi_rec,
+  ratings_test_known,
+  type='ratings')
+
+item_based_cofi_rec_pred_acc <- calcPredictionAccuracy(
+  item_based_cofi_rec_pred,
+  ratings_test_unknown)
+
+item_based_cofi_rec_pred_acc
+```
+
+```{r}
+# latent_factor_cofi_rec_pred <- predict(
+#   latent_factor_cofi_rec,
+#   ratings_test_known,
+#   type='ratings')
+
+# latent_factor_cofi_red_pred_acc <- calcPredictionAccuracy(
+#   latent_factor_cofi_rec_pred,
+#   ratings_test_unknown)
+
+# latent_factor_cofi_red_pred_acc
+```
+
+We can see that the User- and Item-based models perform much better than the Popularity-based model in terms of accuracy.