Skip to content

pull #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions Book Recommendation/Python/ParseData.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
from pandas import read_csv
from os.path import join

from ChicagoBoothML_Helpy.Print import printflush


def parse_book_crossing_data(
data_path='https://raw.githubusercontent.com/ChicagoBoothML/DATA___BookCrossing/master'):

# Common NAs:
na_strings = [
'',
'na', 'n.a', 'n.a.',
'nan', 'n.a.n', 'n.a.n.',
'NA', 'N.A', 'N.A.',
'NaN', 'N.a.N', 'N.a.N.',
'NAN', 'N.A.N', 'N.A.N.',
'nil', 'Nil', 'NIL',
'null', 'Null', 'NULL']

printflush('Parsing Books...', end=' ')
books = read_csv(
join(data_path, 'BX-Books.csv'),
sep=';',
dtype=str,
na_values=na_strings,
usecols=['ISBN', 'Book-Title', 'Book-Author'],
error_bad_lines=False)
printflush('done!')

printflush('Parsing Users...', end=' ')
users = read_csv(
join(data_path, 'BX-Users.csv'),
sep=';',
dtype=str,
na_values=na_strings,
error_bad_lines=False)
users['User-ID'] = users['User-ID'].astype(int)
users['Age'] = users['Age'].astype(float)
printflush('done!')

printflush('Parsing Ratings...', end=' ')
ratings = read_csv(
join(data_path, 'BX-Book-Ratings.csv'),
sep=';',
dtype=str,
na_values=na_strings,
error_bad_lines=False)
ratings['User-ID'] = ratings['User-ID'].astype(int)
ratings['Book-Rating'] = ratings['Book-Rating'].astype(float)
printflush('done!')

return dict(books=books, users=users, ratings=ratings)
305 changes: 305 additions & 0 deletions Book Recommendation/R/BookRecommendation.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
---
title: "Recommendation Engine example: on Book Crossing data set"
author: 'Chicago Booth ML Team'
output: pdf_document
fontsize: 12
geometry: margin=0.6in
---


_**Note:** the current script suffers from memory exhaustion even for a good laptop. With more memory, the code could have been written like below. We'll update this once we've found a way around._


This script uses the [**Book Crossing**](http://www2.informatik.uni-freiburg.de/~cziegler/BX) data set to illustrate personalized recommendation algorithms.

_**Note**: as of the date of writing, the **`recommenderlab`** package's default implementation of the Latent-Factor Collaborative Filtering / Singular Value Decomposition (SVD) method has a bug and produces dodgy performances. We'll not cover this method for now._


# Load Libraries & Helper Modules

```{r message=FALSE, warning=FALSE}
# load RecommenderLab package
library(recommenderlab)

# import data parser
source('https://raw.githubusercontent.com/ChicagoBoothML/MachineLearning_Fall2015/master/Programming%20Scripts/Book%20Recommendation/R/ParseData.R')
```


# Data Import & Preprocessing

```{r}
data <- parse_book_crossing_data()
books <- data$books
users <- data$users
ratings <- data$ratings
ratings[ , `:=`(user_id = factor(user_id),
isbn = factor(isbn))]
```

Let's examine the number of ratings per user and per book:

```{r}
nb_ratings_per_user <-
dcast(ratings, user_id ~ ., fun.aggregate=length, value.var='book_rating')

nb_ratings_per_book <-
dcast(ratings, isbn ~ ., fun.aggregate=length, value.var='book_rating')
```

Each user has rated from `r formatC(min(nb_ratings_per_user$.), big.mark=',')` to `r formatC(max(nb_ratings_per_user$.), big.mark=',')` books, and each book has been rated by from `r formatC(min(nb_ratings_per_book$.), big.mark=',')` to `r formatC(max(nb_ratings_per_book$.), big.mark=',')` users.

```{r}
min_nb_ratings_per_user <- 6
```

Let's remove the users with fewer than `r min_nb_ratings_per_user` to make the recommendation more rigorous:

```{r}
users_with_enough_nb_ratings <- nb_ratings_per_user[. >= min_nb_ratings_per_user, user_id]

ratings <- ratings[user_id %in% users_with_enough_nb_ratings, ]
```

Let's now convert the **`ratings`** to a RecommenderLab-format Real-Valued Rating Matrix:

```{r}
ratings <- as(ratings, 'realRatingMatrix')

ratings
```


# Split Ratings Data into Training & Test sets

Let's now establish a RecommenderLab Evaluation Scheme, which involves splitting the **`ratings`** into a Training set and a Test set:

```{r}
train_proportion <- .5
nb_of_given_ratings_per_test_user <- 3

evaluation_scheme <- evaluationScheme(
ratings,
method='split',
train=train_proportion,
k=1,
given=nb_of_given_ratings_per_test_user)

evaluation_scheme
```

The data sets split out are as follows:

- Training data:

```{r}
ratings_train <- getData(evaluation_scheme, 'train')

ratings_train
```

- Test data: "known"/"given" ratings:

```{r}
ratings_test_known <- getData(evaluation_scheme, 'known')

ratings_test_known
```

- Test data: "unknown" ratings to be predicted and evaluated against:

```{r}
ratings_test_unknown <- getData(evaluation_scheme, 'unknown')

ratings_test_unknown
```


# Recommendation Models

Let's now train a number of recommendation models. The methods available in the **`recommenderlab`** package are:

```{r}
recommenderRegistry$get_entry_names()
```

The descriptions and default parameters for the methods applicable to a Real Rating Matrix are as follows:

```{r}
recommenderRegistry$get_entries(dataType='realRatingMatrix')
```


## Popularity-Based Recommender

The description and default parameters of this method in **`recommenderlab`** are as follows:

```{r}
recommenderRegistry$get_entry('POPULAR', dataType='realRatingMatrix')
```

We train a popularity-based recommender as follows:

```{r}
popular_rec <- Recommender(
data=ratings_train,
method='POPULAR')

popular_rec
```


## User-Based Collaborative-Filtering Recommender

User-Based Collaborative Filtering("**UBCF**") assumes that users with similar preferences will rate items similarly. Thus missing ratings for a user can be predicted by first finding a _**neighborhood**_ of similar users and then aggregate the ratings of these users to form a prediction.

The description and default parameters of this method in **`recommenderlab`** are as follows:

```{r}
recommenderRegistry$get_entry('UBCF', dataType='realRatingMatrix')
```

We train a UBCF recommender as follows:

```{r}
# User-based Collaborative Filtering Recommender
user_based_cofi_rec <- Recommender(
data=ratings_train,
method='UBCF', # User-Based Collaborative Filtering
parameter=list(
normalize='center', # normalizing by subtracting average rating per user;
# note that we don't scale by standard deviations here;
# we are assuming people rate on the same scale but have
# different biases
method='Pearson', # use Pearson correlation
nn=30 # number of Nearest Neighbors for calibration
))

user_based_cofi_rec
```


## Item-Based Collaborative-Filtering Recommender

Item-Based Collaborative Filtering ("**IBCF**") is a model-based approach which produces recommendations based on the relationship between items inferred from the rating matrix. The assumption behind this approach is that users will prefer items that are similar to other items they like.

The model-building step consists of calculating a similarity matrix containing all item-to-item
similarities using a given similarity measure. Popular measures are Pearson correlation and
Cosine similarity. For each item only a list of the $k$ most similar items and their similarity values are stored. The $k$ items which are most similar to item $i$ is denoted by the set $S(i)$ which can be seen as the neighborhood of size $k$ of the item. Retaining only $k$ similarities per item improves the space and time complexity significantly but potentially sacrifices some recommendation quality.

The description and default parameters of this method in **`recommenderlab`** are as follows:

```{r}
recommenderRegistry$get_entry('IBCF', dataType='realRatingMatrix')
```

We train a IBCF recommender as follows:

```{r}
# Item-based Collaborative Filtering Recommender
item_based_cofi_rec <- Recommender(
data=ratings_train,
method='IBCF', # Item-Based Collaborative Filtering
parameter=list(
normalize='center', # normalizing by subtracting average rating per user;
# note that we don't scale by standard deviations here;
# we are assuming people rate on the same scale but have
# different biases
method='Pearson', # use Pearson correlation
k=100 # number of Nearest Neighbors for calibration
))

item_based_cofi_rec
```


## Latent-Factor Collaborative-Filtering Recommender

_**Note**: as of the date of writing, the **`recommenderlab`** package's default implementation of the Latent-Factor Collaborative Filtering / Singular Value Decomposition (SVD) method has a bug and produces dodgy performances. The code in this section is commented out for now._

This approache uses Singular-Value Decomposition (SVD) to factor the Rating Matrix into a product of user-feature and item-feature matrices.

The description and default parameters of this method in **`recommenderlab`** are as follows:

```{r}
recommenderRegistry$get_entry('SVD', dataType='realRatingMatrix')
```

We train a Latent-Factor CF recommender as follows:

```{r}
# Latent-Factor Collaborative Filtering Recommender
# with matrix factorization by Singular-Value Decomposition (SVD)
# latent_factor_cofi_rec <- Recommender(
# data=ratings_train,
# method='SVD', # Item-Based Collaborative Filtering
# parameter=list(
# categories=30, # number of latent factors
# normalize='center', # normalizing by subtracting average rating per user;
# note that we don't scale by standard deviations here;
# we are assuming people rate on the same scale but have
# different biases
# method='Pearson' # use Pearson correlation
# ))

# latent_factor_cofi_rec
```


# Model Evaluation

Now, we make predictions on the Test set and and evaluate these recommenders' OOS performances:

```{r}
popular_rec_pred <- predict(
popular_rec,
ratings_test_known,
type='ratings')

popular_rec_pred_acc <- calcPredictionAccuracy(
popular_rec_pred,
ratings_test_unknown)

popular_rec_pred_acc
```

```{r}
user_based_cofi_rec_pred <- predict(
user_based_cofi_rec,
ratings_test_known,
type='ratings')

user_based_cofi_rec_pred_acc <- calcPredictionAccuracy(
user_based_cofi_rec_pred,
ratings_test_unknown)

user_based_cofi_rec_pred_acc
```

```{r}
item_based_cofi_rec_pred <- predict(
item_based_cofi_rec,
ratings_test_known,
type='ratings')

item_based_cofi_rec_pred_acc <- calcPredictionAccuracy(
item_based_cofi_rec_pred,
ratings_test_unknown)

item_based_cofi_rec_pred_acc
```

```{r}
# latent_factor_cofi_rec_pred <- predict(
# latent_factor_cofi_rec,
# ratings_test_known,
# type='ratings')

# latent_factor_cofi_red_pred_acc <- calcPredictionAccuracy(
# latent_factor_cofi_rec_pred,
# ratings_test_unknown)

# latent_factor_cofi_red_pred_acc
```

We can see that the User- and Item-based models perform much better than the Popularity-based model in terms of accuracy.
Loading