From 61b9acb646af9afb3c01468730449f7ef5a771e6 Mon Sep 17 00:00:00 2001
From: monzchan <166637673+monzchan@users.noreply.github.com>
Date: Sat, 8 Jun 2024 18:03:38 -0400
Subject: [PATCH 1/3] complete assignment=2

---
 02_assignments/assignment_2.ipynb | 87 ++++++++++++++++++++++++++-----
 1 file changed, 74 insertions(+), 13 deletions(-)

diff --git a/02_assignments/assignment_2.ipynb b/02_assignments/assignment_2.ipynb
index 502e41033..1de70ad64 100644
--- a/02_assignments/assignment_2.ipynb
+++ b/02_assignments/assignment_2.ipynb
@@ -59,7 +59,8 @@
     "# Load the \"Caravan\" dataset using the \"load_data\" function from the ISLP package\n",
     "Caravan = load_data('Caravan')\n",
     "\n",
-    "# Add your code here"
+    "# Add your code here\n",
+    "Caravan.describe()\n"
    ]
   },
   {
@@ -81,7 +82,19 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Add your code here"
+    "# Add your code here\n",
+    "print(Caravan)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3c6b411d",
+   "metadata": {},
+   "source": [
+    "(i)  5822\n",
+    "(ii) 86\n",
+    "(iii) Categorical variable with 2 levels, Yes or no.\n",
+    "(iv) 85"
    ]
   },
   {
@@ -120,13 +133,13 @@
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "40a8c0f5",
+   "cell_type": "markdown",
+   "id": "f7af73d6",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "# Your answer here"
+    "(v) Standardization is to put different variables on the same scale for regression analysis, so that the variables could be compared on the same ground to find out which variable has the greatest effect on the response variable. Without standardization, the predictor variable will not give equal contribution to the analysis.\n",
+    "\n",
+    "(vi) It is because standardization is used for numerical data measured on a continuous scale but the response variable in this is case is categorical, so the response variable need not be standardized."
    ]
   },
   {
@@ -145,7 +158,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Add your code here"
+    "Random.seed(1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "857f1491",
+   "metadata": {},
+   "source": [
+    "(vii) A random seed ensure the results are reproducible. The same output will be generated with a particular seed value. It is important to ensure the classification method being consistent. When the data values fall in between the two classes, the process of classifying the data into either class is random. Setting a seed ensures the model makes the same choice each time. "
    ]
   },
   {
@@ -176,7 +197,7 @@
     "testing_X = predictors_standardized[~split]\n",
     "\n",
     "# Define the testing set for Y (response)\n",
-    "testing_Y = Caravan.loc[~split, 'Purchase']\n"
+    "testing_Y = Caravan.loc[~split, 'Purchase']"
    ]
   },
   {
@@ -194,7 +215,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Add your code here"
+    "knn1=KNeighborsClassifier(n_neighbors=1)\n",
+    "training_X, testing_X = [np.asarray(X) for X in[training_X, testing_X]]\n",
+    "knn1.fit(training_X, training_Y)\n",
+    "knn1_pred = knn1.predict(testing_X)\n",
+    "confusion_table(knn1_pred, testing_Y)\n",
+    "np.mean(knn1_pred==testing_Y)"
    ]
   },
   {
@@ -214,7 +240,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# prediction accuracy rate"
+    "The prediction accuracy rate is 89%."
    ]
   },
   {
@@ -224,7 +250,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# prediction error rate"
+    "The error rate is 11%"
    ]
   },
   {
@@ -248,6 +274,23 @@
     "print(percentage_purchase)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "443eaaea",
+   "metadata": {},
+   "source": [
+    "Based on the accuracy rate calculated by considering the true No, true yes, predicted No, and predicted yes, the prediction accuracy rate is 89Í% which is very high. Yet, by looking at the confusion table, the model predicting a ‘Yes’ for a true Yes may not be very accurate.\n",
+    " \n",
+    " ![image.png](attachment:image.png)\n",
+    "\n",
+    "The percentage of customer in the Caravan dataset who actually purchase the insurance is 5.977%.\n",
+    "The percentage of customer in the testing set who have actually purchased the insurance is (68+12)/(1315+88+96+12)*100% = 4.5% which is comparable to the 5.977% of customer actually purchase in the insurance in the whole Caravan dataset.\n",
+    "Yet in the testing set, the accuracy rate of the model being able to predict ‘Yes’ for the ‘true Yes’ is 12/(68+12)*100% = 15%, which is far below the 89.94% accuracy rate of the mode, and is even worse than a random guess (50%)l.\n",
+    "\n",
+    "This is due to the imbalanced class distribution. The model is able to predict ‘No’ for the ‘true No’ very well, and the class ‘No’ has a much higher population that is (100-5.977)% = 94%. This heighten the overall accuracy rate, however, by looking at the confusion table and analyzing the model’s performance in predicting the ‘Yes’ response, it is obvious that the model is not making a good prediction for the ‘true Yes’ response.\n",
+    "\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "a7e19b5e-ad65-47f8-a3ef-e0f68a75048e",
@@ -263,7 +306,25 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Your code here"
+    "knn2=KNeighborsClassifier(n_neighbors=3)\n",
+    "training_X, testing_X = [np.asarray(X) for X in[training_X, testing_X]]\n",
+    "knn2.fit(training_X, training_Y)\n",
+    "knn2_pred = knn2.predict(testing_X)\n",
+    "confusion_table(knn1_pred, testing_Y)\n",
+    "np.mean(knn2_pred==testing_Y)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "186d013e",
+   "metadata": {},
+   "source": [
+    "The overall accuracy rate for the second KNN model is 92.5%. \n",
+    "\n",
+    "![image.png](attachment:image.png) \n",
+    "\n",
+    "Based on the confusion table, the accuracy of the second KNN model to predict the ‘true Yes’ response is 5/(87+5)*100% = 5.43% which is worse than that for the first KNN model.\n",
+    "Through this model has a higher overall accuracy, it does not perform better than the first KNN model to predict the ‘true Yes’ response nor a random guess.\n"
    ]
   },
   {

From 9b7acaa9459a85f1c08e47d91cc15a4900d8628c Mon Sep 17 00:00:00 2001
From: monzchan <166637673+monzchan@users.noreply.github.com>
Date: Sat, 8 Jun 2024 18:07:46 -0400
Subject: [PATCH 2/3] amend img to text

---
 02_assignments/assignment_2.ipynb | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/02_assignments/assignment_2.ipynb b/02_assignments/assignment_2.ipynb
index 1de70ad64..a3ec21571 100644
--- a/02_assignments/assignment_2.ipynb
+++ b/02_assignments/assignment_2.ipynb
@@ -281,7 +281,11 @@
    "source": [
     "Based on the accuracy rate calculated by considering the true No, true yes, predicted No, and predicted yes, the prediction accuracy rate is 89Í% which is very high. Yet, by looking at the confusion table, the model predicting a ‘Yes’ for a true Yes may not be very accurate.\n",
     " \n",
-    " ![image.png](attachment:image.png)\n",
+    "Confusion table:\n",
+    "Truth          No     Yes\n",
+    "Predicted\n",
+    "No            1315     68\n",
+    "Yes            96       12\n",
     "\n",
     "The percentage of customer in the Caravan dataset who actually purchase the insurance is 5.977%.\n",
     "The percentage of customer in the testing set who have actually purchased the insurance is (68+12)/(1315+88+96+12)*100% = 4.5% which is comparable to the 5.977% of customer actually purchase in the insurance in the whole Caravan dataset.\n",
@@ -321,7 +325,12 @@
    "source": [
     "The overall accuracy rate for the second KNN model is 92.5%. \n",
     "\n",
-    "![image.png](attachment:image.png) \n",
+    "Confusion table:\n",
+    "Truth          No     Yes\n",
+    "Predicted\n",
+    "No            1343     87\n",
+    "Yes            22       5\n",
+    "\n",
     "\n",
     "Based on the confusion table, the accuracy of the second KNN model to predict the ‘true Yes’ response is 5/(87+5)*100% = 5.43% which is worse than that for the first KNN model.\n",
     "Through this model has a higher overall accuracy, it does not perform better than the first KNN model to predict the ‘true Yes’ response nor a random guess.\n"

From 5736ba973dab37c39073cb91ba09ea191a0dee03 Mon Sep 17 00:00:00 2001
From: monzchan <166637673+monzchan@users.noreply.github.com>
Date: Sat, 8 Jun 2024 18:11:06 -0400
Subject: [PATCH 3/3] amend format

---
 02_assignments/assignment_2.ipynb | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/02_assignments/assignment_2.ipynb b/02_assignments/assignment_2.ipynb
index a3ec21571..062edc2d6 100644
--- a/02_assignments/assignment_2.ipynb
+++ b/02_assignments/assignment_2.ipynb
@@ -279,13 +279,14 @@
    "id": "443eaaea",
    "metadata": {},
    "source": [
-    "Based on the accuracy rate calculated by considering the true No, true yes, predicted No, and predicted yes, the prediction accuracy rate is 89Í% which is very high. Yet, by looking at the confusion table, the model predicting a ‘Yes’ for a true Yes may not be very accurate.\n",
+    "Based on the accuracy rate calculated by considering the true No, true yes, predicted No, and predicted yes, the prediction accuracy rate is 89% which is very high. Yet, by looking at the confusion table, the model predicting a ‘Yes’ for a true Yes may not be very accurate.\n",
     " \n",
     "Confusion table:\n",
-    "Truth          No     Yes\n",
-    "Predicted\n",
-    "No            1315     68\n",
-    "Yes            96       12\n",
+    "Truth(No), Predicted(No): 1315\n",
+    "Truth(No), Predicted(Yes): 96\n",
+    "Truth(Yes), Predicted(No): 68\n",
+    "Truth(Yes), Predicted(Yes): 12\n",
+    "\n",
     "\n",
     "The percentage of customer in the Caravan dataset who actually purchase the insurance is 5.977%.\n",
     "The percentage of customer in the testing set who have actually purchased the insurance is (68+12)/(1315+88+96+12)*100% = 4.5% which is comparable to the 5.977% of customer actually purchase in the insurance in the whole Caravan dataset.\n",
@@ -326,11 +327,10 @@
     "The overall accuracy rate for the second KNN model is 92.5%. \n",
     "\n",
     "Confusion table:\n",
-    "Truth          No     Yes\n",
-    "Predicted\n",
-    "No            1343     87\n",
-    "Yes            22       5\n",
-    "\n",
+    "Truth(No), Predicted(No): 1343\n",
+    "Truth(No), Predicted(Yes): 22\n",
+    "Truth(Yes), Predicted(No): 87\n",
+    "Truth(Yes), Predicted(Yes): 5\n",
     "\n",
     "Based on the confusion table, the accuracy of the second KNN model to predict the ‘true Yes’ response is 5/(87+5)*100% = 5.43% which is worse than that for the first KNN model.\n",
     "Through this model has a higher overall accuracy, it does not perform better than the first KNN model to predict the ‘true Yes’ response nor a random guess.\n"