Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 83 additions & 13 deletions 02_assignments/assignment_2.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,8 @@
"# Load the \"Caravan\" dataset using the \"load_data\" function from the ISLP package\n",
"Caravan = load_data('Caravan')\n",
"\n",
"# Add your code here"
"# Add your code here\n",
"Caravan.describe()\n"
]
},
{
Expand All @@ -81,7 +82,19 @@
"metadata": {},
"outputs": [],
"source": [
"# Add your code here"
"# Add your code here\n",
"print(Caravan)"
]
},
{
"cell_type": "markdown",
"id": "3c6b411d",
"metadata": {},
"source": [
"(i) 5822\n",
"(ii) 86\n",
"(iii) Categorical variable with 2 levels, Yes or no.\n",
"(iv) 85"
]
},
{
Expand Down Expand Up @@ -120,13 +133,13 @@
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40a8c0f5",
"cell_type": "markdown",
"id": "f7af73d6",
"metadata": {},
"outputs": [],
"source": [
"# Your answer here"
"(v) Standardization is to put different variables on the same scale for regression analysis, so that the variables could be compared on the same ground to find out which variable has the greatest effect on the response variable. Without standardization, the predictor variable will not give equal contribution to the analysis.\n",
"\n",
"(vi) It is because standardization is used for numerical data measured on a continuous scale but the response variable in this is case is categorical, so the response variable need not be standardized."
]
},
{
Expand All @@ -145,7 +158,15 @@
"metadata": {},
"outputs": [],
"source": [
"# Add your code here"
"Random.seed(1)"
]
},
{
"cell_type": "markdown",
"id": "857f1491",
"metadata": {},
"source": [
"(vii) A random seed ensure the results are reproducible. The same output will be generated with a particular seed value. It is important to ensure the classification method being consistent. When the data values fall in between the two classes, the process of classifying the data into either class is random. Setting a seed ensures the model makes the same choice each time. "
]
},
{
Expand Down Expand Up @@ -176,7 +197,7 @@
"testing_X = predictors_standardized[~split]\n",
"\n",
"# Define the testing set for Y (response)\n",
"testing_Y = Caravan.loc[~split, 'Purchase']\n"
"testing_Y = Caravan.loc[~split, 'Purchase']"
]
},
{
Expand All @@ -194,7 +215,12 @@
"metadata": {},
"outputs": [],
"source": [
"# Add your code here"
"knn1=KNeighborsClassifier(n_neighbors=1)\n",
"training_X, testing_X = [np.asarray(X) for X in[training_X, testing_X]]\n",
"knn1.fit(training_X, training_Y)\n",
"knn1_pred = knn1.predict(testing_X)\n",
"confusion_table(knn1_pred, testing_Y)\n",
"np.mean(knn1_pred==testing_Y)"
]
},
{
Expand All @@ -214,7 +240,7 @@
"metadata": {},
"outputs": [],
"source": [
"# prediction accuracy rate"
"The prediction accuracy rate is 89%."
]
},
{
Expand All @@ -224,7 +250,7 @@
"metadata": {},
"outputs": [],
"source": [
"# prediction error rate"
"The error rate is 11%"
]
},
{
Expand All @@ -248,6 +274,28 @@
"print(percentage_purchase)"
]
},
{
"cell_type": "markdown",
"id": "443eaaea",
"metadata": {},
"source": [
"Based on the accuracy rate calculated by considering the true No, true yes, predicted No, and predicted yes, the prediction accuracy rate is 89% which is very high. Yet, by looking at the confusion table, the model predicting a ‘Yes’ for a true Yes may not be very accurate.\n",
" \n",
"Confusion table:\n",
"Truth(No), Predicted(No): 1315\n",
"Truth(No), Predicted(Yes): 96\n",
"Truth(Yes), Predicted(No): 68\n",
"Truth(Yes), Predicted(Yes): 12\n",
"\n",
"\n",
"The percentage of customer in the Caravan dataset who actually purchase the insurance is 5.977%.\n",
"The percentage of customer in the testing set who have actually purchased the insurance is (68+12)/(1315+88+96+12)*100% = 4.5% which is comparable to the 5.977% of customer actually purchase in the insurance in the whole Caravan dataset.\n",
"Yet in the testing set, the accuracy rate of the model being able to predict ‘Yes’ for the ‘true Yes’ is 12/(68+12)*100% = 15%, which is far below the 89.94% accuracy rate of the mode, and is even worse than a random guess (50%)l.\n",
"\n",
"This is due to the imbalanced class distribution. The model is able to predict ‘No’ for the ‘true No’ very well, and the class ‘No’ has a much higher population that is (100-5.977)% = 94%. This heighten the overall accuracy rate, however, by looking at the confusion table and analyzing the model’s performance in predicting the ‘Yes’ response, it is obvious that the model is not making a good prediction for the ‘true Yes’ response.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "a7e19b5e-ad65-47f8-a3ef-e0f68a75048e",
Expand All @@ -263,7 +311,29 @@
"metadata": {},
"outputs": [],
"source": [
"# Your code here"
"knn2=KNeighborsClassifier(n_neighbors=3)\n",
"training_X, testing_X = [np.asarray(X) for X in[training_X, testing_X]]\n",
"knn2.fit(training_X, training_Y)\n",
"knn2_pred = knn2.predict(testing_X)\n",
"confusion_table(knn1_pred, testing_Y)\n",
"np.mean(knn2_pred==testing_Y)"
]
},
{
"cell_type": "markdown",
"id": "186d013e",
"metadata": {},
"source": [
"The overall accuracy rate for the second KNN model is 92.5%. \n",
"\n",
"Confusion table:\n",
"Truth(No), Predicted(No): 1343\n",
"Truth(No), Predicted(Yes): 22\n",
"Truth(Yes), Predicted(No): 87\n",
"Truth(Yes), Predicted(Yes): 5\n",
"\n",
"Based on the confusion table, the accuracy of the second KNN model to predict the ‘true Yes’ response is 5/(87+5)*100% = 5.43% which is worse than that for the first KNN model.\n",
"Through this model has a higher overall accuracy, it does not perform better than the first KNN model to predict the ‘true Yes’ response nor a random guess.\n"
]
},
{
Expand Down