diff --git a/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb b/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb new file mode 100644 index 0000000..0ecc6f1 --- /dev/null +++ b/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb @@ -0,0 +1,4475 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "fce74c70-b998-437d-bd77-43d723d57f13", + "metadata": {}, + "source": [ + "# Handling Missing Data\n", + "One of the first steps in any data science workflow is to understand the dataset and to clean it. This is because real world datasets are often very messy and require significant preprocessing before they can be used for subsequent data science tasks such as feature engineering, model training, etc. One of the tasks within data cleaning is to handle with missing data. There are several approaches that can be taken for missing data, such as dropping it, filling with 0's, filling with mean, KNN imputation, etc. In this notebook, we will explore 2 of these imputation techniques, and compare their effectiveness on two sample datasets.\n", + "\n", + "a. The first sample dataset we will use is random numbers, we will generate ~1000 random numbers and perform basic KNN and mean imputation.\n", + "\n", + "b. The second sample dataset we will use is UCI housing dataset, we will use both scaled and non-scaled imputation technique for mean and KNN imputation" + ] + }, + { + "cell_type": "markdown", + "id": "e2ceaeb0-e282-4c63-97e2-f1dd03810aa2", + "metadata": {}, + "source": [ + "# What to try in this notebook?\n", + "\n", + "#### 1. Get a random number generated dataset from kaggle, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", + "\n", + "\n", + "#### 2. Use a housing dataset from UCI, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", + "\n", + "Dataset - https://raw.githubusercontent.com/SheshNGupta/datasets/main/train.csv" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "d8fe4103-6e71-4b97-810c-b599a0482944", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "from sklearn.impute import KNNImputer\n", + "from sklearn.preprocessing import MinMaxScaler" + ] + }, + { + "cell_type": "markdown", + "id": "f95427ef-d6bc-47b8-a516-45a05b238180", + "metadata": {}, + "source": [ + "# 1.1 Random Numbers dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "ae373dd4-26c0-46e8-bdba-dd1d31c77e4e", + "metadata": {}, + "outputs": [], + "source": [ + "random_dataset = pd.DataFrame({'number': np.random.rand(1000)})" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "5ea97930-03cd-48ff-97b9-97e9cd9dde55", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + " | number | \n", + "
---|---|
823 | \n", + "0.925249 | \n", + "
266 | \n", + "0.077479 | \n", + "
959 | \n", + "0.897447 | \n", + "
493 | \n", + "0.259423 | \n", + "
768 | \n", + "0.193178 | \n", + "
105 | \n", + "0.174632 | \n", + "
610 | \n", + "0.456349 | \n", + "
824 | \n", + "0.688290 | \n", + "
968 | \n", + "0.493667 | \n", + "
849 | \n", + "0.368834 | \n", + "
\n", + " | number | \n", + "number_copy_1_percent | \n", + "number_copy_5_percent | \n", + "number_copy_10_percent | \n", + "
---|---|---|---|---|
0 | \n", + "0.438564 | \n", + "0.438564 | \n", + "0.438564 | \n", + "0.438564 | \n", + "
1 | \n", + "0.836801 | \n", + "0.836801 | \n", + "0.836801 | \n", + "0.836801 | \n", + "
2 | \n", + "0.798077 | \n", + "0.798077 | \n", + "0.798077 | \n", + "0.798077 | \n", + "
3 | \n", + "0.269161 | \n", + "0.269161 | \n", + "0.269161 | \n", + "0.269161 | \n", + "
4 | \n", + "0.830948 | \n", + "0.830948 | \n", + "0.830948 | \n", + "0.830948 | \n", + "
... | \n", + "... | \n", + "... | \n", + "... | \n", + "... | \n", + "
995 | \n", + "0.920130 | \n", + "0.920130 | \n", + "0.920130 | \n", + "0.920130 | \n", + "
996 | \n", + "0.007397 | \n", + "0.007397 | \n", + "0.007397 | \n", + "0.007397 | \n", + "
997 | \n", + "0.163360 | \n", + "0.163360 | \n", + "0.163360 | \n", + "0.163360 | \n", + "
998 | \n", + "0.553700 | \n", + "0.553700 | \n", + "0.553700 | \n", + "0.553700 | \n", + "
999 | \n", + "0.771442 | \n", + "0.771442 | \n", + "0.771442 | \n", + "0.771442 | \n", + "
1000 rows × 4 columns
\n", + "\n", + " | number | \n", + "number_copy_1_percent | \n", + "number_copy_5_percent | \n", + "number_copy_10_percent | \n", + "
---|---|---|---|---|
701 | \n", + "0.244629 | \n", + "0.244629 | \n", + "0.244629 | \n", + "0.244629 | \n", + "
39 | \n", + "0.517202 | \n", + "0.517202 | \n", + "0.517202 | \n", + "0.517202 | \n", + "
335 | \n", + "0.100813 | \n", + "0.100813 | \n", + "0.100813 | \n", + "0.100813 | \n", + "
204 | \n", + "0.277534 | \n", + "0.277534 | \n", + "0.277534 | \n", + "0.277534 | \n", + "
391 | \n", + "0.859032 | \n", + "0.859032 | \n", + "0.857231 | \n", + "0.859032 | \n", + "
203 | \n", + "0.252622 | \n", + "0.252622 | \n", + "0.252622 | \n", + "0.252622 | \n", + "
144 | \n", + "0.844587 | \n", + "0.844587 | \n", + "0.844587 | \n", + "0.844587 | \n", + "
201 | \n", + "0.431603 | \n", + "0.431603 | \n", + "0.431603 | \n", + "0.431603 | \n", + "
749 | \n", + "0.848537 | \n", + "0.848537 | \n", + "0.848537 | \n", + "0.848240 | \n", + "
497 | \n", + "0.464531 | \n", + "0.464531 | \n", + "0.464531 | \n", + "0.464531 | \n", + "
\n", + " | number | \n", + "number_copy_1_percent | \n", + "number_copy_5_percent | \n", + "number_copy_10_percent | \n", + "
---|---|---|---|---|
293 | \n", + "0.583231 | \n", + "0.583231 | \n", + "0.583231 | \n", + "0.583231 | \n", + "
461 | \n", + "0.867035 | \n", + "0.867035 | \n", + "0.867035 | \n", + "0.867035 | \n", + "
875 | \n", + "0.676228 | \n", + "0.676228 | \n", + "0.676228 | \n", + "0.676228 | \n", + "
999 | \n", + "0.771442 | \n", + "0.771442 | \n", + "0.771442 | \n", + "0.771442 | \n", + "
75 | \n", + "0.909050 | \n", + "0.909050 | \n", + "0.909050 | \n", + "0.909050 | \n", + "
98 | \n", + "0.629583 | \n", + "0.629583 | \n", + "0.629583 | \n", + "0.629583 | \n", + "
381 | \n", + "0.181614 | \n", + "0.181614 | \n", + "0.181614 | \n", + "0.181614 | \n", + "
592 | \n", + "0.523109 | \n", + "0.523109 | \n", + "0.523109 | \n", + "0.523109 | \n", + "
155 | \n", + "0.038074 | \n", + "0.038074 | \n", + "0.038074 | \n", + "0.038074 | \n", + "
630 | \n", + "0.869200 | \n", + "0.869200 | \n", + "0.869200 | \n", + "0.869200 | \n", + "
' + table._repr_html_() + ' | ' for table in table_list]) +\n", + " '
\n",
+ "\n",
+ "
| \n",
+ "\n",
+ "
|
\n",
+ "\n",
+ "
| \n",
+ "\n",
+ "
|
\n",
+ "\n",
+ "
| \n",
+ "\n",
+ "
|
\n",
+ "\n",
+ "
| \n",
+ "\n",
+ "
|
\n", + " | Id | \n", + "MSSubClass | \n", + "MSZoning | \n", + "LotFrontage | \n", + "LotArea | \n", + "Street | \n", + "Alley | \n", + "LotShape | \n", + "LandContour | \n", + "Utilities | \n", + "... | \n", + "PoolArea | \n", + "PoolQC | \n", + "Fence | \n", + "MiscFeature | \n", + "MiscVal | \n", + "MoSold | \n", + "YrSold | \n", + "SaleType | \n", + "SaleCondition | \n", + "SalePrice | \n", + "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
740 | \n", + "741 | \n", + "70 | \n", + "RM | \n", + "60.0 | \n", + "9600 | \n", + "Pave | \n", + "Grvl | \n", + "Reg | \n", + "Lvl | \n", + "AllPub | \n", + "... | \n", + "0 | \n", + "NaN | \n", + "GdPrv | \n", + "NaN | \n", + "0 | \n", + "5 | \n", + "2007 | \n", + "WD | \n", + "Abnorml | \n", + "132000 | \n", + "
1209 | \n", + "1210 | \n", + "20 | \n", + "RL | \n", + "85.0 | \n", + "10182 | \n", + "Pave | \n", + "NaN | \n", + "IR1 | \n", + "Lvl | \n", + "AllPub | \n", + "... | \n", + "0 | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "0 | \n", + "5 | \n", + "2006 | \n", + "New | \n", + "Partial | \n", + "290000 | \n", + "
64 | \n", + "65 | \n", + "60 | \n", + "RL | \n", + "NaN | \n", + "9375 | \n", + "Pave | \n", + "NaN | \n", + "Reg | \n", + "Lvl | \n", + "AllPub | \n", + "... | \n", + "0 | \n", + "NaN | \n", + "GdPrv | \n", + "NaN | \n", + "0 | \n", + "2 | \n", + "2009 | \n", + "WD | \n", + "Normal | \n", + "219500 | \n", + "
208 | \n", + "209 | \n", + "60 | \n", + "RL | \n", + "NaN | \n", + "14364 | \n", + "Pave | \n", + "NaN | \n", + "IR1 | \n", + "Low | \n", + "AllPub | \n", + "... | \n", + "0 | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "0 | \n", + "4 | \n", + "2007 | \n", + "WD | \n", + "Normal | \n", + "277000 | \n", + "
436 | \n", + "437 | \n", + "50 | \n", + "RM | \n", + "40.0 | \n", + "4400 | \n", + "Pave | \n", + "NaN | \n", + "Reg | \n", + "Lvl | \n", + "AllPub | \n", + "... | \n", + "0 | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "0 | \n", + "10 | \n", + "2006 | \n", + "WD | \n", + "Normal | \n", + "116000 | \n", + "
19 | \n", + "20 | \n", + "20 | \n", + "RL | \n", + "70.0 | \n", + "7560 | \n", + "Pave | \n", + "NaN | \n", + "Reg | \n", + "Lvl | \n", + "AllPub | \n", + "... | \n", + "0 | \n", + "NaN | \n", + "MnPrv | \n", + "NaN | \n", + "0 | \n", + "5 | \n", + "2009 | \n", + "COD | \n", + "Abnorml | \n", + "139000 | \n", + "
1449 | \n", + "1450 | \n", + "180 | \n", + "RM | \n", + "21.0 | \n", + "1533 | \n", + "Pave | \n", + "NaN | \n", + "Reg | \n", + "Lvl | \n", + "AllPub | \n", + "... | \n", + "0 | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "0 | \n", + "8 | \n", + "2006 | \n", + "WD | \n", + "Abnorml | \n", + "92000 | \n", + "
449 | \n", + "450 | \n", + "50 | \n", + "RM | \n", + "50.0 | \n", + "6000 | \n", + "Pave | \n", + "NaN | \n", + "Reg | \n", + "Lvl | \n", + "AllPub | \n", + "... | \n", + "0 | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "0 | \n", + "6 | \n", + "2007 | \n", + "WD | \n", + "Normal | \n", + "120000 | \n", + "
1185 | \n", + "1186 | \n", + "50 | \n", + "RL | \n", + "60.0 | \n", + "9738 | \n", + "Pave | \n", + "NaN | \n", + "Reg | \n", + "Lvl | \n", + "AllPub | \n", + "... | \n", + "0 | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "0 | \n", + "3 | \n", + "2006 | \n", + "WD | \n", + "Normal | \n", + "104900 | \n", + "
1023 | \n", + "1024 | \n", + "120 | \n", + "RL | \n", + "43.0 | \n", + "3182 | \n", + "Pave | \n", + "NaN | \n", + "Reg | \n", + "Lvl | \n", + "AllPub | \n", + "... | \n", + "0 | \n", + "NaN | \n", + "NaN | \n", + "NaN | \n", + "0 | \n", + "5 | \n", + "2008 | \n", + "WD | \n", + "Normal | \n", + "191000 | \n", + "
10 rows × 81 columns
\n", + "\n", + " | SalePrice | \n", + "sp_copy_1_percent | \n", + "sp_copy_5_percent | \n", + "sp_copy_10_percent | \n", + "
---|---|---|---|---|
0 | \n", + "208500 | \n", + "208500 | \n", + "208500 | \n", + "208500 | \n", + "
1 | \n", + "181500 | \n", + "181500 | \n", + "181500 | \n", + "181500 | \n", + "
2 | \n", + "223500 | \n", + "223500 | \n", + "223500 | \n", + "223500 | \n", + "
3 | \n", + "140000 | \n", + "140000 | \n", + "140000 | \n", + "140000 | \n", + "
4 | \n", + "250000 | \n", + "250000 | \n", + "250000 | \n", + "250000 | \n", + "
\n", + " | SalePrice | \n", + "sp_copy_1_percent | \n", + "sp_copy_5_percent | \n", + "sp_copy_10_percent | \n", + "
---|---|---|---|---|
0 | \n", + "0.241078 | \n", + "0.241078 | \n", + "0.241078 | \n", + "0.241078 | \n", + "
1 | \n", + "0.203583 | \n", + "0.203583 | \n", + "0.203583 | \n", + "0.203583 | \n", + "
2 | \n", + "0.261908 | \n", + "0.261908 | \n", + "0.261908 | \n", + "0.261908 | \n", + "
3 | \n", + "0.145952 | \n", + "0.145952 | \n", + "0.145952 | \n", + "0.145952 | \n", + "
4 | \n", + "0.298709 | \n", + "0.298709 | \n", + "0.298709 | \n", + "0.298709 | \n", + "
\n", + " | SalePrice | \n", + "sp_copy_1_percent | \n", + "sp_copy_5_percent | \n", + "sp_copy_10_percent | \n", + "
---|---|---|---|---|
0 | \n", + "208500.0 | \n", + "208500.0 | \n", + "208500.0 | \n", + "208500.0 | \n", + "
1 | \n", + "181500.0 | \n", + "181500.0 | \n", + "181500.0 | \n", + "181500.0 | \n", + "
2 | \n", + "223500.0 | \n", + "223500.0 | \n", + "223500.0 | \n", + "223500.0 | \n", + "
3 | \n", + "140000.0 | \n", + "140000.0 | \n", + "140000.0 | \n", + "140000.0 | \n", + "
4 | \n", + "250000.0 | \n", + "250000.0 | \n", + "250000.0 | \n", + "250000.0 | \n", + "
\n", + " | SalePrice | \n", + "sp_copy_1_percent | \n", + "sp_copy_5_percent | \n", + "sp_copy_10_percent | \n", + "
---|---|---|---|---|
0 | \n", + "0.241078 | \n", + "0.241078 | \n", + "0.241078 | \n", + "0.241078 | \n", + "
1 | \n", + "0.203583 | \n", + "0.203583 | \n", + "0.203583 | \n", + "0.203583 | \n", + "
2 | \n", + "0.261908 | \n", + "0.261908 | \n", + "0.261908 | \n", + "0.261908 | \n", + "
3 | \n", + "0.145952 | \n", + "0.145952 | \n", + "0.145952 | \n", + "0.145952 | \n", + "
4 | \n", + "0.298709 | \n", + "0.298709 | \n", + "0.298709 | \n", + "0.298709 | \n", + "
\n", + " | diff. list Mean(KNN) | \n", + "diff. list Var.(KNN) | \n", + "
---|---|---|
1%_saleprice | \n", + "105.000 | \n", + "5.210500e+04 | \n", + "
5%_saleprice | \n", + "163.012 | \n", + "4.601896e+04 | \n", + "
10%_saleprice | \n", + "470.800 | \n", + "3.667553e+06 | \n", + "
\n", + " | diff. list Mean(KNN) scaled | \n", + "diff. list Var.(KNN) scaled | \n", + "
---|---|---|
1%_saleprice | \n", + "0.000000 | \n", + "0.000000e+00 | \n", + "
5%_saleprice | \n", + "0.000012 | \n", + "7.654124e-09 | \n", + "
10%_saleprice | \n", + "0.000265 | \n", + "2.973842e-06 | \n", + "
\n", + " | SalePrice | \n", + "sp_copy_1_percent | \n", + "sp_copy_5_percent | \n", + "sp_copy_10_percent | \n", + "
---|---|---|---|---|
436 | \n", + "116000 | \n", + "116000.0 | \n", + "116000.0 | \n", + "116000.000000 | \n", + "
21 | \n", + "139400 | \n", + "139400.0 | \n", + "139400.0 | \n", + "139400.000000 | \n", + "
618 | \n", + "314813 | \n", + "314813.0 | \n", + "314813.0 | \n", + "314813.000000 | \n", + "
207 | \n", + "141000 | \n", + "141000.0 | \n", + "141000.0 | \n", + "182369.783333 | \n", + "
366 | \n", + "159000 | \n", + "159000.0 | \n", + "159000.0 | \n", + "159000.000000 | \n", + "
\n", + " | SalePrice | \n", + "sp_copy_1_percent | \n", + "sp_copy_5_percent | \n", + "sp_copy_10_percent | \n", + "
---|---|---|---|---|
457 | \n", + "0.307041 | \n", + "0.307041 | \n", + "0.307041 | \n", + "0.201890 | \n", + "
876 | \n", + "0.135190 | \n", + "0.135190 | \n", + "0.135190 | \n", + "0.135190 | \n", + "
361 | \n", + "0.152895 | \n", + "0.152895 | \n", + "0.152895 | \n", + "0.152895 | \n", + "
682 | \n", + "0.191779 | \n", + "0.191779 | \n", + "0.191779 | \n", + "0.201890 | \n", + "
523 | \n", + "0.208096 | \n", + "0.208096 | \n", + "0.208096 | \n", + "0.208096 | \n", + "
\n", + " | diff. list Mean(MI) | \n", + "diff. list Var.(MI) | \n", + "
---|---|---|
1%_saleprice | \n", + "47198.616970 | \n", + "6.345466e+08 | \n", + "
5%_saleprice | \n", + "54438.206863 | \n", + "1.768876e+09 | \n", + "
10%_saleprice | \n", + "58045.636667 | \n", + "2.875291e+09 | \n", + "
\n", + " | diff. list Mean(MI) scaled | \n", + "diff. list Var.(MI) scaled | \n", + "
---|---|---|
1%_saleprice_scaled | \n", + "0.000000 | \n", + "0.000000 | \n", + "
5%_saleprice_scaled | \n", + "0.001618 | \n", + "0.000056 | \n", + "
10%_saleprice_scaled | \n", + "0.018922 | \n", + "0.004251 | \n", + "
\n",
+ "\n",
+ "
| \n",
+ "\n",
+ "
| \n",
+ "\n",
+ "
| \n",
+ "\n",
+ "
|