Homework

Solution:

Notebook

Video

Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.

Dataset

In this homework, we will use Credit Card Data from book "Econometric Analysis".

Here's a wget-able link:

wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AER_credit_card_data.csv

The goal of this homework is to inspect the output of different evaluation metrics by creating a classification model (target column card).

Preparation

Create the target variable by mapping yes to 1 and no to 0.
Split the dataset into 3 parts: train/validation/test with 60%/20%/20% distribution. Use train_test_split function for that with random_state=1.

Question 1

ROC AUC could also be used to evaluate feature importance of numerical variables.

Let's do that

For each numerical variable, use it as score and compute AUC with the card variable.
Use the training dataset for that.

If your AUC is < 0.5, invert this variable by putting "-" in front

(e.g. -df_train['expenditure'])

AUC can go below 0.5 if the variable is negatively correlated with the target varialble. You can change the direction of the correlation by negating this variable - then negative correlation becomes positive.

Which numerical variable (among the following 4) has the highest AUC?

reports
dependents
active
share

Training the model

From now on, use these columns only:

["reports", "age", "income", "share", "expenditure", "dependents", "months", "majorcards", "active", "owner", "selfemp"]

Apply one-hot-encoding using DictVectorizer and train the logistic regression with these parameters:

LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)

Question 2

What's the AUC of this model on the validation dataset? (round to 3 digits)

0.615
0.515
0.715
0.995

Question 3

Now let's compute precision and recall for our model.

Evaluate the model on the validation dataset on all thresholds from 0.0 to 1.0 with step 0.01
For each threshold, compute precision and recall
Plot them

At which threshold precision and recall curves intersect?

0.1
0.3
0.6
0.8

Question 4

Precision and recall are conflicting - when one grows, the other goes down. That's why they are often combined into the F1 score - a metrics that takes into account both

This is the formula for computing $F_1$:

$$F_1 = 2 \cdot \cfrac{P \cdot R}{P + R}$$

Where $P$ is precision and $R$ is recall.

Let's compute F1 for all thresholds from 0.0 to 1.0 with increment 0.01 using the validation set

At which threshold F1 is maximal?

0.1
0.4
0.6
0.7

Question 5

Use the KFold class from Scikit-Learn to evaluate our model on 5 different folds:

KFold(n_splits=5, shuffle=True, random_state=1)

Iterate over different folds of df_full_train
Split the data into train and validation
Train the model on train with these parameters: LogisticRegression(solver='liblinear', C=1.0, max_iter=1000)
Use AUC to evaluate the model on validation

How large is standard devidation of the AUC scores across different folds?

0.003
0.014
0.09
0.24

Question 6

Now let's use 5-Fold cross-validation to find the best parameter C

Iterate over the following C values: [0.01, 0.1, 1, 10]
Initialize KFold with the same parameters as previously
Use these parametes for the model: LogisticRegression(solver='liblinear', C=C, max_iter=1000)
Compute the mean score as well as the std (round the mean and std to 3 decimal digits)

Which C leads to the best mean score?

0.01
0.1
1
10

If you have ties, select the score with the lowest std. If you still have ties, select the smallest C

Submit the results

Submit your results here: https://forms.gle/8TfKNRd5Jq7sGK5M9
You can submit your solution multiple times. In this case, only the last submission will be used
If your answer doesn't match options exactly, select the closest one

Deadline

The deadline for submitting is October 3 (Monday), 23:00 CEST.

After that, the form will be closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

homework.md

homework.md

Homework

Dataset

Preparation

Question 1

Training the model

Question 2

Question 3

Question 4

Question 5

Question 6

Submit the results

Deadline

Files

homework.md

Latest commit

History

homework.md

File metadata and controls

Homework

Dataset

Preparation

Question 1

Training the model

Question 2

Question 3

Question 4

Question 5

Question 6

Submit the results

Deadline