diff --git a/02_activities/assignments/assignment_1.ipynb b/02_activities/assignments/assignment_1.ipynb index 828092657..be4afd8c0 100644 --- a/02_activities/assignments/assignment_1.ipynb +++ b/02_activities/assignments/assignment_1.ipynb @@ -96,7 +96,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Your answer here" + "wine_df.shape[0]" ] }, { @@ -114,7 +114,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Your answer here" + "wine_df.shape[1]" ] }, { @@ -132,7 +132,8 @@ "metadata": {}, "outputs": [], "source": [ - "# Your answer here" + "wine_df['class'].dtype,\n", + "wine_df['class'].unique()" ] }, { @@ -151,7 +152,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Your answer here" + "wine_df.shape[1]-1" ] }, { @@ -204,7 +205,7 @@ "id": "403ef0bb", "metadata": {}, "source": [ - "> Your answer here..." + "It’s important to standardize predictor variables because it puts them on the same scale, which helps models work better. For example, If one variable ranges from 0–1 and another from 1–10,000, the bigger one can unfairly dominate the model. Also, it is easier to compare effects – In linear models, standardization lets you see which variables matter more. Also, models like logistic regression or neural nets learn faster and more reliably when variables are standardized. Algorithms like KNN or SVM use distances, and without standardization, variables with large scales throw off the results.set.\r\n" ] }, { @@ -220,7 +221,7 @@ "id": "fdee5a15", "metadata": {}, "source": [ - "> Your answer here..." + "The response variable in our case is categorical (e.g., discrete labels like 0, 1, 2), and standardization is only meaningful for continuous numerical data. Standardizing class labels would distort their categorical meaning and interfere with classification algorithms, which rely on clearly defined, separate classes. For example, standardizing labels like 0, 1, and 2 would transform them into continuous values, removing the distinct class boundaries and confusing the model. More generally, even in cases where the response variable is continuous (e.g., blood pressure or probability of disease), standardization is usually avoided unless there's a specific modeling reason. This is because the response variable represents a real-world quantity we want to predict as it exists, and standardizing it would change its scale and interpretation—forcing predictions to be expressed in terms of standard deviations rather than meaningful units." ] }, { @@ -236,7 +237,7 @@ "id": "f0676c21", "metadata": {}, "source": [ - "> Your answer here..." + "It is very important because setting a random seed ensures that random processes are reproducible. Many data science and machine learning tasks involve randomness. Without setting a seed, these processes would yield different results each time you run the code, making replicating results or debugging models hard. The specific seed value is not inherently important — any integer will work. What matters is that the same seed produces the same sequence of random numbers, ensuring consistency and reproducibility. Different seeds will lead to different results, so choosing a fixed seed allows others to replicate your results exactly." ] }, { @@ -261,7 +262,9 @@ "\n", "# split the data into a training and testing set. hint: use train_test_split !\n", "\n", - "# Your code here ..." + "response=wine_df['class']\n", + "\n", + "X_train, X_test, y_tarin, y_test=train_test_split(predictors_standardized, response, test_size=0.25, random_state=123)" ] }, { @@ -289,7 +292,20 @@ "metadata": {}, "outputs": [], "source": [ - "# Your code here..." + "#1. Initialize the KNN classifier\n", + "knn = KNeighborsClassifier()\n", + "\n", + "#2. Define the parameter grid: n_neighbors from 1 to 50\n", + "param_grid = {'n_neighbors': list(range(1, 51))}\n", + "\n", + "#3. Set up GridSearchCV with 10-fold cross-validation\n", + "grid_search = GridSearchCV(knn, param_grid, cv=10)\n", + "\n", + "# Fit the model using the training data\n", + "grid_search.fit(X_train, y_train)\n", + "\n", + "#4. Return the best value for n_neighbors\n", + "grid_search.best_params_['n_neighbors'] code here..." ] }, { @@ -310,7 +326,21 @@ "metadata": {}, "outputs": [], "source": [ - "# Your code here..." + "# Retrieve the best k\n", + "best_k = grid_search.best_params_['n_neighbors']\n", + "\n", + "# Initialize the final model with the best number of neighbors\n", + "final_knn = KNeighborsClassifier(n_neighbors=best_k)\n", + "\n", + "# Fit the model on the training data\n", + "final_knn.fit(X_train, y_train)\n", + "\n", + "# Predict on the test set\n", + "y_pred = final_knn.predict(X_test)\n", + "\n", + "# Evaluate accuracy\n", + "accuracy = accuracy_score(y_test, y_pred)\n", + "accuracy" ] }, { @@ -365,7 +395,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3.10.4", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -379,7 +409,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.12.4" }, "vscode": { "interpreter": {