From 61ba300b95119ebdec81fd840e4dc2286cf0a564 Mon Sep 17 00:00:00 2001 From: Shesh Narayan Gupta <91396937+SheshNGupta@users.noreply.github.com> Date: Mon, 6 Jun 2022 18:46:17 -0400 Subject: [PATCH 1/8] Adding imputation best practices notebook This notebook demonstrated how to use the imputation techniques on missing data --- notebooks/Imputation_best_practices.ipynb | 4557 +++++++++++++++++++++ notebooks/random_numbers_1000.csv | 1001 +++++ 2 files changed, 5558 insertions(+) create mode 100644 notebooks/Imputation_best_practices.ipynb create mode 100644 notebooks/random_numbers_1000.csv diff --git a/notebooks/Imputation_best_practices.ipynb b/notebooks/Imputation_best_practices.ipynb new file mode 100644 index 0000000..87d582d --- /dev/null +++ b/notebooks/Imputation_best_practices.ipynb @@ -0,0 +1,4557 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "e2ceaeb0-e282-4c63-97e2-f1dd03810aa2", + "metadata": {}, + "source": [ + "# What to try in this notebook?\n", + "\n", + "#### 1. Get a random number generated dataset from kaggle, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", + "\n", + "Dataset - https://www.kaggle.com/timoboz/random-numbers\n", + "\n", + "#### 2. Use a housing dataset from UCI, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", + "\n", + "Dataset - https://github.com/nikbearbrown/AI_Research_Group/blob/main/Awesome-UCI-Datasets/Classification/House_Price_predication/train.csv" + ] + }, + { + "cell_type": "code", + "execution_count": 101, + "id": "d8fe4103-6e71-4b97-810c-b599a0482944", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "from sklearn.impute import KNNImputer\n", + "from sklearn.preprocessing import MinMaxScaler" + ] + }, + { + "cell_type": "markdown", + "id": "f95427ef-d6bc-47b8-a516-45a05b238180", + "metadata": {}, + "source": [ + "# 1.1 Random Numbers dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "id": "03fc0415-cdd2-415b-a273-08037b06afcf", + "metadata": {}, + "outputs": [], + "source": [ + "random_dataset = pd.read_csv('random_numbers_1000.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 103, + "id": "5ea97930-03cd-48ff-97b9-97e9cd9dde55", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0number
7827820.955151
3783780.310217
5425420.607177
80800.861696
2822820.204316
9769760.059688
9249240.372837
3293290.406915
1311310.402420
6076070.078909
\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 number\n", + "782 782 0.955151\n", + "378 378 0.310217\n", + "542 542 0.607177\n", + "80 80 0.861696\n", + "282 282 0.204316\n", + "976 976 0.059688\n", + "924 924 0.372837\n", + "329 329 0.406915\n", + "131 131 0.402420\n", + "607 607 0.078909" + ] + }, + "execution_count": 103, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "random_dataset.sample(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 104, + "id": "f19e199b-91aa-4e03-9e07-37f5a574d481", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 1000 entries, 0 to 999\n", + "Data columns (total 2 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 Unnamed: 0 1000 non-null int64 \n", + " 1 number 1000 non-null float64\n", + "dtypes: float64(1), int64(1)\n", + "memory usage: 15.8 KB\n" + ] + } + ], + "source": [ + "random_dataset.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 105, + "id": "382f0f03-b3f4-4244-a95c-e78476fae2ca", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "count 1000.000000\n", + "mean 0.490463\n", + "std 0.284669\n", + "min 0.000068\n", + "25% 0.252124\n", + "50% 0.479825\n", + "75% 0.735584\n", + "max 0.997610\n", + "Name: number, dtype: float64" + ] + }, + "execution_count": 105, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "random_dataset['number'].describe()" + ] + }, + { + "cell_type": "markdown", + "id": "348a0b85-c450-4d5d-a9d2-c57c95964b42", + "metadata": {}, + "source": [ + "#### Create 3 col. for numbers for 1%, 5% and 10% missing data" + ] + }, + { + "cell_type": "code", + "execution_count": 106, + "id": "f5de26b3-17b7-463b-98e4-147a457ca37e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
00.1446160.1446160.1446160.144616
10.0775150.0775150.0775150.077515
20.1559330.1559330.1559330.155933
30.0972090.0972090.0972090.097209
40.3237500.3237500.3237500.323750
...............
9950.1821070.1821070.1821070.182107
9960.7879880.7879880.7879880.787988
9970.1487070.1487070.1487070.148707
9980.1531210.1531210.1531210.153121
9990.4747370.4747370.4747370.474737
\n", + "

1000 rows × 4 columns

\n", + "
" + ], + "text/plain": [ + " number number_copy_1_percent number_copy_5_percent \\\n", + "0 0.144616 0.144616 0.144616 \n", + "1 0.077515 0.077515 0.077515 \n", + "2 0.155933 0.155933 0.155933 \n", + "3 0.097209 0.097209 0.097209 \n", + "4 0.323750 0.323750 0.323750 \n", + ".. ... ... ... \n", + "995 0.182107 0.182107 0.182107 \n", + "996 0.787988 0.787988 0.787988 \n", + "997 0.148707 0.148707 0.148707 \n", + "998 0.153121 0.153121 0.153121 \n", + "999 0.474737 0.474737 0.474737 \n", + "\n", + " number_copy_10_percent \n", + "0 0.144616 \n", + "1 0.077515 \n", + "2 0.155933 \n", + "3 0.097209 \n", + "4 0.323750 \n", + ".. ... \n", + "995 0.182107 \n", + "996 0.787988 \n", + "997 0.148707 \n", + "998 0.153121 \n", + "999 0.474737 \n", + "\n", + "[1000 rows x 4 columns]" + ] + }, + "execution_count": 106, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_number = random_dataset[['number']]\n", + "df_number['number_copy_1_percent'] = df_number[['number']]\n", + "df_number['number_copy_5_percent'] = df_number[['number']]\n", + "df_number['number_copy_10_percent'] = df_number[['number']]\n", + "df_number" + ] + }, + { + "cell_type": "markdown", + "id": "1ff95002-46a0-454b-97c1-6c189153d459", + "metadata": {}, + "source": [ + "#### Check % missing values in this dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 107, + "id": "35c38775-26d9-4b1e-97a9-4c46c0d5d92b", + "metadata": {}, + "outputs": [], + "source": [ + "def get_percent_missing(dataframe):\n", + " \n", + " percent_missing = dataframe.isnull().sum() * 100 / len(dataframe)\n", + " missing_value_df = pd.DataFrame({'column_name': dataframe.columns,\n", + " 'percent_missing': percent_missing})\n", + " return missing_value_df" + ] + }, + { + "cell_type": "code", + "execution_count": 108, + "id": "6837b7e5-4444-4914-9c0e-a9cefd2c7b6f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 0.0\n", + "number_copy_5_percent number_copy_5_percent 0.0\n", + "number_copy_10_percent number_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_number))" + ] + }, + { + "cell_type": "markdown", + "id": "25318ebf-b1bf-4f4b-ba1d-011b27a27f39", + "metadata": {}, + "source": [ + "#### Create missing helper fn" + ] + }, + { + "cell_type": "code", + "execution_count": 109, + "id": "76da9076-d9c8-417e-bcfc-8ce7066d1a53", + "metadata": {}, + "outputs": [], + "source": [ + "def create_missing(dataframe, percent, col):\n", + " dataframe.loc[dataframe.sample(frac = percent).index, col] = np.nan" + ] + }, + { + "cell_type": "markdown", + "id": "9dc43e57-be39-4efe-8131-d6a3423b8d77", + "metadata": {}, + "source": [ + "#### Create missing data in each col" + ] + }, + { + "cell_type": "code", + "execution_count": 110, + "id": "6e8ab693-6043-4ade-b62a-9b3fc9ebf735", + "metadata": {}, + "outputs": [], + "source": [ + "create_missing(df_number, 0.01, 'number_copy_1_percent')\n", + "create_missing(df_number, 0.05, 'number_copy_5_percent')\n", + "create_missing(df_number, 0.1, 'number_copy_10_percent')" + ] + }, + { + "cell_type": "markdown", + "id": "655cb92a-6b63-4498-9c31-d63f11145569", + "metadata": {}, + "source": [ + "#### Check % missing after removing data" + ] + }, + { + "cell_type": "code", + "execution_count": 111, + "id": "412518b5-67ec-4a5a-9720-4a0ce7657d44", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 1.0\n", + "number_copy_5_percent number_copy_5_percent 5.0\n", + "number_copy_10_percent number_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_number))" + ] + }, + { + "cell_type": "markdown", + "id": "6876e3fc-b878-4560-a3a4-72c36f2a422e", + "metadata": {}, + "source": [ + "#### Store the indices of missing rows" + ] + }, + { + "cell_type": "code", + "execution_count": 112, + "id": "c1860270-add6-4963-9aef-27ef1e171fca", + "metadata": {}, + "outputs": [], + "source": [ + "# Store Index of NaN values in each coloumns\n", + "number_1_idx = list(np.where(df_number['number_copy_1_percent'].isna())[0])\n", + "number_5_idx = list(np.where(df_number['number_copy_5_percent'].isna())[0])\n", + "number_10_idx = list(np.where(df_number['number_copy_10_percent'].isna())[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 113, + "id": "57841da6-b453-40cc-8ecc-702fe4613a74", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Length of number_1_idx is 10 and it contains 1.0% of total data in column | Total rows: 1000\n", + "Length of number_5_idx is 50 and it contains 5.0% of total data in column | Total rows: 1000\n", + "Length of number_10_idx is 100 and it contains 10.0% of total data in column | Total rows: 1000\n" + ] + } + ], + "source": [ + "print(f\"Length of number_1_idx is {len(number_1_idx)} and it contains {(len(number_1_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")\n", + "print(f\"Length of number_5_idx is {len(number_5_idx)} and it contains {(len(number_5_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")\n", + "print(f\"Length of number_10_idx is {len(number_10_idx)} and it contains {(len(number_10_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")" + ] + }, + { + "cell_type": "markdown", + "id": "47469d0b-a8f3-4469-b18c-3a457f7dc373", + "metadata": {}, + "source": [ + "### Perform KNN impute to df_number dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 114, + "id": "b09c6c85-4ce3-4aeb-bb81-6a698494a58e", + "metadata": {}, + "outputs": [], + "source": [ + "df_number1 = df_number.copy(deep=True)\n", + "imputer = KNNImputer(n_neighbors=5)\n", + "imputed_number_df = pd.DataFrame(imputer.fit_transform(df_number1), columns = df_number1.columns)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 115, + "id": "2f051a7d-3ebd-4839-aae0-ef125944d613", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
3470.3723890.3723890.3723890.372389
9340.3277660.3277660.3277660.327766
9270.7538920.7538920.7538920.753892
9970.1487070.1487070.1487070.148707
1670.7309010.7309010.7309010.730901
9140.8413300.8413300.8413300.841330
4320.8974660.8974660.8974660.897466
5870.4116850.4116850.4116850.411685
8840.3787940.3787940.3787940.378794
3790.2654290.2654290.2654290.264843
\n", + "
" + ], + "text/plain": [ + " number number_copy_1_percent number_copy_5_percent \\\n", + "347 0.372389 0.372389 0.372389 \n", + "934 0.327766 0.327766 0.327766 \n", + "927 0.753892 0.753892 0.753892 \n", + "997 0.148707 0.148707 0.148707 \n", + "167 0.730901 0.730901 0.730901 \n", + "914 0.841330 0.841330 0.841330 \n", + "432 0.897466 0.897466 0.897466 \n", + "587 0.411685 0.411685 0.411685 \n", + "884 0.378794 0.378794 0.378794 \n", + "379 0.265429 0.265429 0.265429 \n", + "\n", + " number_copy_10_percent \n", + "347 0.372389 \n", + "934 0.327766 \n", + "927 0.753892 \n", + "997 0.148707 \n", + "167 0.730901 \n", + "914 0.841330 \n", + "432 0.897466 \n", + "587 0.411685 \n", + "884 0.378794 \n", + "379 0.264843 " + ] + }, + "execution_count": 115, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imputed_number_df.sample(10)" + ] + }, + { + "cell_type": "markdown", + "id": "ddc79a45-bd2b-44f3-a3c4-aaefa73b43d9", + "metadata": {}, + "source": [ + "#### Check the % missing data in dataframe now" + ] + }, + { + "cell_type": "code", + "execution_count": 116, + "id": "5c98d450-bf5a-46e5-9091-c6a1202a2611", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 0.0\n", + "number_copy_5_percent number_copy_5_percent 0.0\n", + "number_copy_10_percent number_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(imputed_number_df))" + ] + }, + { + "cell_type": "markdown", + "id": "f14476bf-29e6-4d9a-9cd4-9dd56a53b466", + "metadata": {}, + "source": [ + "#### Store the list of differences between org. and Imputed value" + ] + }, + { + "cell_type": "code", + "execution_count": 117, + "id": "3f096800-dc6e-4455-a9e6-2db18884e5ee", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "number_diff_1 = []\n", + "number_diff_5 = []\n", + "number_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in number_1_idx:\n", + " count +=1\n", + " diff1 = abs(imputed_number_df['number_copy_1_percent'][i] - df_number1['number'][i])\n", + " number_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in number_5_idx:\n", + " diff5 = abs(imputed_number_df['number_copy_5_percent'][i] - df_number1['number'][i])\n", + " number_diff_5.append(diff5)\n", + "\n", + "for i in number_10_idx:\n", + " diff10 = abs(imputed_number_df['number_copy_10_percent'][i] - df_number1['number'][i])\n", + " number_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 118, + "id": "4a2c29fc-99f3-4624-808e-437d3983cabb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(number_diff_1))\n", + "print(len(number_diff_5))\n", + "print(len(number_diff_10))" + ] + }, + { + "cell_type": "markdown", + "id": "4ec4adbe-5571-40e3-90ba-92cb431161ca", + "metadata": {}, + "source": [ + "### Calculate the mean and varience of list of differences KNN" + ] + }, + { + "cell_type": "code", + "execution_count": 119, + "id": "1163cb62-9dc4-427e-b5cf-20bf3e16d79b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.0007902710470742466 and varience 1% is 4.5687016451605466e-07\n", + "The mean of 5% is 0.000675654857997236 and varience 5% is 3.072444468179742e-07\n", + "The mean of 10% is 0.000675654857997236 and varience 10% is 2.480608628449602e-07\n" + ] + } + ], + "source": [ + "m1 = sum(number_diff_1) / len(number_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in number_diff_1) / len(number_diff_1)\n", + "\n", + "m5 = sum(number_diff_5) / len(number_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in number_diff_5) / len(number_diff_5)\n", + "\n", + "\n", + "m10 = sum(number_diff_10) / len(number_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in number_diff_10) / len(number_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 120, + "id": "6987d059-7449-44a0-a3c2-8605362a18a0", + "metadata": {}, + "outputs": [], + "source": [ + "df_knn_number = pd.DataFrame.from_dict({'1%_number': [m1, var_res1],\n", + " '5%_number': [m5, var_res5],\n", + " '10%_number': [m10, var_res10]}, orient='index')\n", + "df_knn_number.columns=['diff. list Mean(KNN)', 'diff. list Var.(KNN)']" + ] + }, + { + "cell_type": "markdown", + "id": "41740e20-5dae-403e-a83b-94c91469fcc3", + "metadata": {}, + "source": [ + "### Perform MEAN based imputation" + ] + }, + { + "cell_type": "markdown", + "id": "17b69478-e97c-41b9-828a-eefbb46eb161", + "metadata": {}, + "source": [ + "#### Before mean imputation % missing" + ] + }, + { + "cell_type": "code", + "execution_count": 121, + "id": "5a828216-8f1a-4157-8141-77e6c929f57a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 1.0\n", + "number_copy_5_percent number_copy_5_percent 5.0\n", + "number_copy_10_percent number_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "df_number2 = df_number.copy(deep=True)\n", + "print(get_percent_missing(df_number2))" + ] + }, + { + "cell_type": "code", + "execution_count": 122, + "id": "1e137676-9f01-44b9-8a84-50d03a89436b", + "metadata": {}, + "outputs": [], + "source": [ + "df_number2['number_copy_1_percent'] = df_number2['number_copy_1_percent'].fillna(df_number2['number_copy_1_percent'].mean())\n", + "df_number2['number_copy_5_percent'] = df_number2['number_copy_5_percent'].fillna(df_number2['number_copy_5_percent'].mean())\n", + "df_number2['number_copy_10_percent'] = df_number2['number_copy_10_percent'].fillna(df_number2['number_copy_10_percent'].mean())" + ] + }, + { + "cell_type": "markdown", + "id": "8da82021-d96a-46ac-81df-035977cb5497", + "metadata": {}, + "source": [ + "#### After mean impute % missing " + ] + }, + { + "cell_type": "code", + "execution_count": 123, + "id": "669c14bd-f920-47db-8476-1cd1b4f4f5bb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 0.0\n", + "number_copy_5_percent number_copy_5_percent 0.0\n", + "number_copy_10_percent number_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_number2))" + ] + }, + { + "cell_type": "code", + "execution_count": 124, + "id": "ccb60d18-b24e-4211-9947-46ee0bcc06fe", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
3660.4255250.4255250.4255250.425525
1450.2465890.2465890.2465890.246589
5380.5037010.5037010.5037010.503701
2560.1189010.1189010.4919320.118901
1560.7732150.7732150.7732150.773215
5000.4410870.4410870.4410870.441087
3250.0950680.0950680.0950680.095068
970.2098420.2098420.2098420.487348
9050.1176570.4910840.1176570.117657
2510.9613050.9613050.9613050.961305
\n", + "
" + ], + "text/plain": [ + " number number_copy_1_percent number_copy_5_percent \\\n", + "366 0.425525 0.425525 0.425525 \n", + "145 0.246589 0.246589 0.246589 \n", + "538 0.503701 0.503701 0.503701 \n", + "256 0.118901 0.118901 0.491932 \n", + "156 0.773215 0.773215 0.773215 \n", + "500 0.441087 0.441087 0.441087 \n", + "325 0.095068 0.095068 0.095068 \n", + "97 0.209842 0.209842 0.209842 \n", + "905 0.117657 0.491084 0.117657 \n", + "251 0.961305 0.961305 0.961305 \n", + "\n", + " number_copy_10_percent \n", + "366 0.425525 \n", + "145 0.246589 \n", + "538 0.503701 \n", + "256 0.118901 \n", + "156 0.773215 \n", + "500 0.441087 \n", + "325 0.095068 \n", + "97 0.487348 \n", + "905 0.117657 \n", + "251 0.961305 " + ] + }, + "execution_count": 124, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_number2.sample(10)" + ] + }, + { + "cell_type": "markdown", + "id": "88d89795-0ae9-4f37-89cd-b24d36658588", + "metadata": {}, + "source": [ + "#### Create a list of difference - MEAN" + ] + }, + { + "cell_type": "code", + "execution_count": 125, + "id": "530979d5-52c4-473d-95f3-754c460a7ab6", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "number_diff_1_mean = []\n", + "number_diff_5_mean = []\n", + "number_diff_10_mean = []\n", + "count = 0\n", + "\n", + "for i in number_1_idx:\n", + " count +=1\n", + " diff1 = abs(df_number2['number_copy_1_percent'][i] - df_number2['number'][i])\n", + " number_diff_1_mean.append(diff1)\n", + " \n", + "\n", + "for i in number_5_idx:\n", + " diff5 = abs(df_number2['number_copy_5_percent'][i] - df_number2['number'][i])\n", + " number_diff_5_mean.append(diff5)\n", + "\n", + "for i in number_10_idx:\n", + " diff10 = abs(df_number2['number_copy_10_percent'][i] - df_number2['number'][i])\n", + " number_diff_10_mean.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 126, + "id": "28dd2494-0175-431e-b4b7-09ee4af1f6a0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(number_diff_1_mean))\n", + "print(len(number_diff_5_mean))\n", + "print(len(number_diff_10_mean))" + ] + }, + { + "cell_type": "markdown", + "id": "4e90251e-4c0a-4e2d-82b1-8764374aed1c", + "metadata": {}, + "source": [ + "### Calculate the mean and var of the list of differences - MEAN Impute" + ] + }, + { + "cell_type": "code", + "execution_count": 127, + "id": "682bd76e-4875-4b4d-b90b-91d8a6e492ae", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.269368727544059 and varience 1% is 0.018130331928686818\n", + "The mean of 5% is 0.18484105170274112 and varience 5% is 0.014920933643125705\n", + "The mean of 10% is 0.18484105170274112 and varience 10% is 0.020023889816061954\n" + ] + } + ], + "source": [ + "m1 = sum(number_diff_1_mean) / len(number_diff_1_mean)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in number_diff_1_mean) / len(number_diff_1_mean)\n", + "\n", + "m5 = sum(number_diff_5_mean) / len(number_diff_5_mean)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in number_diff_5_mean) / len(number_diff_5_mean)\n", + "\n", + "\n", + "m10 = sum(number_diff_10_mean) / len(number_diff_10_mean)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in number_diff_10_mean) / len(number_diff_10_mean)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 128, + "id": "1f41880d-3e7d-48c9-8744-7e47ccae3c17", + "metadata": {}, + "outputs": [], + "source": [ + "df_MI_number = pd.DataFrame.from_dict({'1%_number': [m1, var_res1],\n", + " '5%_number': [m5, var_res5],\n", + " '10%_number': [m10, var_res10]}, orient='index')\n", + "df_MI_number.columns=['diff. list Mean(MI)', 'diff. list Var.(MI)']" + ] + }, + { + "cell_type": "markdown", + "id": "ec64b079-db97-429c-ae3a-519eec91db3f", + "metadata": {}, + "source": [ + "## KNN and MEAN columns side by side" + ] + }, + { + "cell_type": "code", + "execution_count": 129, + "id": "d74b0e73-e3f0-4107-806d-c5d5a50aab9a", + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import display_html\n", + "from itertools import chain,cycle\n", + "def display_side_by_side(*args,titles=cycle([''])):\n", + " html_str=''\n", + " for df,title in zip(args, chain(titles,cycle(['
'])) ):\n", + " html_str+=''\n", + " html_str+=f'

{title}

'\n", + " html_str+=df.to_html().replace('table','table style=\"display:inline\"')\n", + " html_str+=''\n", + " display_html(html_str,raw=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 130, + "id": "747a487f-cbc4-467a-9bc7-b0856dbb6576", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 130, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from IPython.display import display, HTML\n", + "\n", + "CSS = \"\"\"\n", + ".output {\n", + " flex-direction: row;\n", + "}\n", + "\"\"\"\n", + "\n", + "HTML(''.format(CSS))" + ] + }, + { + "cell_type": "code", + "execution_count": 131, + "id": "d24551d1-cd58-4a41-8262-873fe5034272", + "metadata": {}, + "outputs": [], + "source": [ + "# https://github.com/epmoyer/ipy_table/issues/24\n", + "\n", + "from IPython.core.display import HTML\n", + "\n", + "def multi_table(table_list):\n", + " ''' Acceps a list of IpyTable objects and returns a table which contains each IpyTable in a cell\n", + " '''\n", + " return HTML(\n", + " '' + \n", + " ''.join(['' for table in table_list]) +\n", + " '
' + table._repr_html_() + '
'\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 132, + "id": "8a8daa30-3abf-4315-ae58-f9171ff000d5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[124, 257, 309, 313, 405]\n" + ] + } + ], + "source": [ + "print(number_1_idx[:5])" + ] + }, + { + "cell_type": "code", + "execution_count": 133, + "id": "da6b1646-2417-42b7-bc8f-d3b0be85c61b", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1 = imputed_number_df.loc[:, [\"number\", \"number_copy_1_percent\"]]\n", + "compare_5 = imputed_number_df.loc[:, [\"number\", \"number_copy_5_percent\"]]\n", + "compare_10 = imputed_number_df.loc[:, [\"number\", \"number_copy_10_percent\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": 134, + "id": "380b94cf-264f-4a41-bb1d-ac272354073f", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1_df = compare_1.iloc[number_1_idx]\n", + "compare_5_df = compare_5.iloc[number_5_idx]\n", + "compare_10_df = compare_10.iloc[number_10_idx]" + ] + }, + { + "cell_type": "code", + "execution_count": 135, + "id": "e5b21e71-0ddd-4c60-b931-b384d65230dd", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1_mean = df_number2.loc[:, [\"number\", \"number_copy_1_percent\"]]\n", + "compare_5_mean = df_number2.loc[:, [\"number\", \"number_copy_5_percent\"]]\n", + "compare_10_mean = df_number2.loc[:, [\"number\", \"number_copy_10_percent\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": 136, + "id": "29be3554-8129-4f0c-bad6-1270b7c6c05b", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1_mean_df = compare_1_mean.iloc[number_1_idx]\n", + "compare_5_mean_df = compare_5_mean.iloc[number_5_idx]\n", + "compare_10_mean_df = compare_10_mean.iloc[number_10_idx]" + ] + }, + { + "cell_type": "code", + "execution_count": 137, + "id": "27b96ecc-3566-48f5-bec5-9b073c575cb6", + "metadata": {}, + "outputs": [], + "source": [ + "# display_side_by_side(compare_1_df.head(), compare_1_mean_df.head(), titles=['number 1% KNN Impute','number 1% Mean Impute'])\n", + "# display_side_by_side(compare_5_df.head(), compare_5_mean_df.head(), titles=['number 5% KNN Impute','number 5% Mean Impute'])\n", + "# display_side_by_side(compare_10_df.head(), compare_10_mean_df.head(), titles=['number 10% KNN Impute','number 10% Mean Impute'])" + ] + }, + { + "cell_type": "markdown", + "id": "72a3bc3c-0f91-49ad-bf03-dc4b7ace265d", + "metadata": {}, + "source": [ + "#### **number 1% KNN Impute VS number 1% Mean Impute**" + ] + }, + { + "cell_type": "code", + "execution_count": 138, + "id": "6fd11f89-9f4b-49b3-b114-1ab3b461f180", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percent
1240.1929900.192926
2570.0656020.066172
3090.6614470.663769
3130.9639510.962988
4050.6274600.627545
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percent
1240.1929900.491084
2570.0656020.491084
3090.6614470.491084
3130.9639510.491084
4050.6274600.491084
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 138, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([compare_1_df.head(), compare_1_mean_df.head()])" + ] + }, + { + "cell_type": "markdown", + "id": "e1fc9d1c-53ef-42d3-809b-d68051057e48", + "metadata": {}, + "source": [ + "#### **number 5% KNN Impute VS number 5% Mean Impute**" + ] + }, + { + "cell_type": "code", + "execution_count": 139, + "id": "a97c1530-2e50-48d2-a7e0-89fc70f648e5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_5_percent
540.4401440.439307
590.1896550.191045
720.4114510.412386
780.2051780.204306
1070.3230970.322044
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_5_percent
540.4401440.491932
590.1896550.491932
720.4114510.491932
780.2051780.491932
1070.3230970.491932
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 139, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([compare_5_df.head(), compare_5_mean_df.head()])" + ] + }, + { + "cell_type": "markdown", + "id": "1e732ac9-faf7-4457-baef-ac9c4976598c", + "metadata": {}, + "source": [ + "#### **number 10% KNN Impute VS number 10% Mean Impute**" + ] + }, + { + "cell_type": "code", + "execution_count": 140, + "id": "f2d22e8f-5a0b-48c0-9150-a391d48e93b2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_10_percent
220.7981880.798777
470.8614540.861385
490.4451080.446055
680.5574680.557299
690.2311720.230069
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_10_percent
220.7981880.487348
470.8614540.487348
490.4451080.487348
680.5574680.487348
690.2311720.487348
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 140, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([compare_10_df.head(), compare_10_mean_df.head()])" + ] + }, + { + "cell_type": "markdown", + "id": "cc817314-971f-4abf-a56e-9830a5cf0329", + "metadata": {}, + "source": [ + "# 1.2 Random Numbers dataset Results - KNN and MEAN" + ] + }, + { + "cell_type": "code", + "execution_count": 142, + "id": "1397844d-6757-471c-bd76-ff84d466b150", + "metadata": {}, + "outputs": [], + "source": [ + "results = pd.concat([df_knn_number, df_MI_number])" + ] + }, + { + "cell_type": "code", + "execution_count": 143, + "id": "51868cc7-20f3-499d-a76d-f06f99ea1841", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN)diff. list Var.(KNN)diff. list Mean(MI)diff. list Var.(MI)
1%_number0.0007904.568702e-07NaNNaN
5%_number0.0006763.072444e-07NaNNaN
10%_number0.0006482.480609e-07NaNNaN
1%_numberNaNNaN0.2693690.018130
5%_numberNaNNaN0.1848410.014921
10%_numberNaNNaN0.2315010.020024
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(KNN) diff. list Var.(KNN) diff. list Mean(MI) \\\n", + "1%_number 0.000790 4.568702e-07 NaN \n", + "5%_number 0.000676 3.072444e-07 NaN \n", + "10%_number 0.000648 2.480609e-07 NaN \n", + "1%_number NaN NaN 0.269369 \n", + "5%_number NaN NaN 0.184841 \n", + "10%_number NaN NaN 0.231501 \n", + "\n", + " diff. list Var.(MI) \n", + "1%_number NaN \n", + "5%_number NaN \n", + "10%_number NaN \n", + "1%_number 0.018130 \n", + "5%_number 0.014921 \n", + "10%_number 0.020024 " + ] + }, + "execution_count": 143, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "results" + ] + }, + { + "cell_type": "code", + "execution_count": 144, + "id": "85deaebb-3a2b-4b52-bf80-ce31499a70d8", + "metadata": {}, + "outputs": [], + "source": [ + "results.to_csv('random_num_knn_mean_results.csv')" + ] + }, + { + "cell_type": "markdown", + "id": "08586561-e3a5-4d15-a1c0-b8d71731a84a", + "metadata": {}, + "source": [ + "# 2.1 Housing Dataset " + ] + }, + { + "cell_type": "code", + "execution_count": 361, + "id": "c05f4dd5-4cdc-4617-939a-2e22ec859af1", + "metadata": {}, + "outputs": [], + "source": [ + "housing_data = pd.read_csv('https://raw.githubusercontent.com/nikbearbrown/AI_Research_Group/main/Awesome-UCI-Datasets/Classification/House_Price_predication/train.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 362, + "id": "8564d163-97ce-44da-8d3c-6f8cd9c1d0a1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
82082160RL72.07226PaveNaNIR1LvlAllPub...0NaNNaNNaN062008WDNormal183000
1390139120RL70.09100PaveNaNRegLvlAllPub...0NaNNaNNaN092006WDNormal235000
535536190RL70.07000PaveNaNRegLvlAllPub...0NaNNaNNaN012008WDNormal107500
12361237160RL36.02628PaveNaNRegLvlAllPub...0NaNNaNNaN062010WDNormal175500
1337133830RM153.04118PaveGrvlIR1BnkAllPub...0NaNNaNNaN032006WDNormal52500
67467520RL80.09200PaveNaNRegLvlAllPub...0NaNNaNNaN072008WDNormal140000
60460520RL88.012803PaveNaNIR1LvlAllPub...0NaNNaNNaN092008WDNormal221000
60560660RL85.013600PaveNaNRegLvlAllPub...0NaNNaNNaN0102009WDNormal205000
1218121950RM52.06240PaveNaNRegLvlAllPub...0NaNNaNNaN072006WDNormal80500
88288360RLNaN9636PaveNaNIR1LvlAllPub...0NaNMnPrvNaN0122009WDNormal178000
\n", + "

10 rows × 81 columns

\n", + "
" + ], + "text/plain": [ + " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", + "820 821 60 RL 72.0 7226 Pave NaN IR1 \n", + "1390 1391 20 RL 70.0 9100 Pave NaN Reg \n", + "535 536 190 RL 70.0 7000 Pave NaN Reg \n", + "1236 1237 160 RL 36.0 2628 Pave NaN Reg \n", + "1337 1338 30 RM 153.0 4118 Pave Grvl IR1 \n", + "674 675 20 RL 80.0 9200 Pave NaN Reg \n", + "604 605 20 RL 88.0 12803 Pave NaN IR1 \n", + "605 606 60 RL 85.0 13600 Pave NaN Reg \n", + "1218 1219 50 RM 52.0 6240 Pave NaN Reg \n", + "882 883 60 RL NaN 9636 Pave NaN IR1 \n", + "\n", + " LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal \\\n", + "820 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "1390 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "535 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "1236 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "1337 Bnk AllPub ... 0 NaN NaN NaN 0 \n", + "674 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "604 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "605 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "1218 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "882 Lvl AllPub ... 0 NaN MnPrv NaN 0 \n", + "\n", + " MoSold YrSold SaleType SaleCondition SalePrice \n", + "820 6 2008 WD Normal 183000 \n", + "1390 9 2006 WD Normal 235000 \n", + "535 1 2008 WD Normal 107500 \n", + "1236 6 2010 WD Normal 175500 \n", + "1337 3 2006 WD Normal 52500 \n", + "674 7 2008 WD Normal 140000 \n", + "604 9 2008 WD Normal 221000 \n", + "605 10 2009 WD Normal 205000 \n", + "1218 7 2006 WD Normal 80500 \n", + "882 12 2009 WD Normal 178000 \n", + "\n", + "[10 rows x 81 columns]" + ] + }, + "execution_count": 362, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data.sample(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 363, + "id": "bd81975c-0a21-414b-8e20-3564d35b9f9b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "663" + ] + }, + "execution_count": 363, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['SalePrice'].nunique()" + ] + }, + { + "cell_type": "code", + "execution_count": 364, + "id": "67d1046e-a1ad-412e-a7e8-a0d51729cec7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1073" + ] + }, + "execution_count": 364, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['LotArea'].nunique()" + ] + }, + { + "cell_type": "code", + "execution_count": 365, + "id": "64b05e52-72dc-4f7d-aca3-d043036b4d2f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "count 1460.000000\n", + "mean 180921.195890\n", + "std 79442.502883\n", + "min 34900.000000\n", + "25% 129975.000000\n", + "50% 163000.000000\n", + "75% 214000.000000\n", + "max 755000.000000\n", + "Name: SalePrice, dtype: float64" + ] + }, + "execution_count": 365, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['SalePrice'].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 366, + "id": "b7e9928c-4785-4ee1-8150-cd0fa1ef3325", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "count 1460.000000\n", + "mean 10516.828082\n", + "std 9981.264932\n", + "min 1300.000000\n", + "25% 7553.500000\n", + "50% 9478.500000\n", + "75% 11601.500000\n", + "max 215245.000000\n", + "Name: LotArea, dtype: float64" + ] + }, + "execution_count": 366, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['LotArea'].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 367, + "id": "20149f80-07dc-4eaa-8d0e-7de6612a7dce", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "Id Id 0.000000\n", + "MSSubClass MSSubClass 0.000000\n", + "MSZoning MSZoning 0.000000\n", + "LotFrontage LotFrontage 17.739726\n", + "LotArea LotArea 0.000000\n", + "Street Street 0.000000\n", + "Alley Alley 93.767123\n", + "LotShape LotShape 0.000000\n", + "LandContour LandContour 0.000000\n", + "Utilities Utilities 0.000000\n", + "LotConfig LotConfig 0.000000\n", + "LandSlope LandSlope 0.000000\n", + "Neighborhood Neighborhood 0.000000\n", + "Condition1 Condition1 0.000000\n", + "Condition2 Condition2 0.000000\n", + "BldgType BldgType 0.000000\n", + "HouseStyle HouseStyle 0.000000\n", + "OverallQual OverallQual 0.000000\n", + "OverallCond OverallCond 0.000000\n", + "YearBuilt YearBuilt 0.000000\n", + "YearRemodAdd YearRemodAdd 0.000000\n", + "RoofStyle RoofStyle 0.000000\n", + "RoofMatl RoofMatl 0.000000\n", + "Exterior1st Exterior1st 0.000000\n", + "Exterior2nd Exterior2nd 0.000000\n", + "MasVnrType MasVnrType 0.547945\n", + "MasVnrArea MasVnrArea 0.547945\n", + "ExterQual ExterQual 0.000000\n", + "ExterCond ExterCond 0.000000\n", + "Foundation Foundation 0.000000\n", + "BsmtQual BsmtQual 2.534247\n", + "BsmtCond BsmtCond 2.534247\n", + "BsmtExposure BsmtExposure 2.602740\n", + "BsmtFinType1 BsmtFinType1 2.534247\n", + "BsmtFinSF1 BsmtFinSF1 0.000000\n", + "BsmtFinType2 BsmtFinType2 2.602740\n", + "BsmtFinSF2 BsmtFinSF2 0.000000\n", + "BsmtUnfSF BsmtUnfSF 0.000000\n", + "TotalBsmtSF TotalBsmtSF 0.000000\n", + "Heating Heating 0.000000\n", + "HeatingQC HeatingQC 0.000000\n", + "CentralAir CentralAir 0.000000\n", + "Electrical Electrical 0.068493\n", + "1stFlrSF 1stFlrSF 0.000000\n", + "2ndFlrSF 2ndFlrSF 0.000000\n", + "LowQualFinSF LowQualFinSF 0.000000\n", + "GrLivArea GrLivArea 0.000000\n", + "BsmtFullBath BsmtFullBath 0.000000\n", + "BsmtHalfBath BsmtHalfBath 0.000000\n", + "FullBath FullBath 0.000000\n", + "HalfBath HalfBath 0.000000\n", + "BedroomAbvGr BedroomAbvGr 0.000000\n", + "KitchenAbvGr KitchenAbvGr 0.000000\n", + "KitchenQual KitchenQual 0.000000\n", + "TotRmsAbvGrd TotRmsAbvGrd 0.000000\n", + "Functional Functional 0.000000\n", + "Fireplaces Fireplaces 0.000000\n", + "FireplaceQu FireplaceQu 47.260274\n", + "GarageType GarageType 5.547945\n", + "GarageYrBlt GarageYrBlt 5.547945\n", + "GarageFinish GarageFinish 5.547945\n", + "GarageCars GarageCars 0.000000\n", + "GarageArea GarageArea 0.000000\n", + "GarageQual GarageQual 5.547945\n", + "GarageCond GarageCond 5.547945\n", + "PavedDrive PavedDrive 0.000000\n", + "WoodDeckSF WoodDeckSF 0.000000\n", + "OpenPorchSF OpenPorchSF 0.000000\n", + "EnclosedPorch EnclosedPorch 0.000000\n", + "3SsnPorch 3SsnPorch 0.000000\n", + "ScreenPorch ScreenPorch 0.000000\n", + "PoolArea PoolArea 0.000000\n", + "PoolQC PoolQC 99.520548\n", + "Fence Fence 80.753425\n", + "MiscFeature MiscFeature 96.301370\n", + "MiscVal MiscVal 0.000000\n", + "MoSold MoSold 0.000000\n", + "YrSold YrSold 0.000000\n", + "SaleType SaleType 0.000000\n", + "SaleCondition SaleCondition 0.000000\n", + "SalePrice SalePrice 0.000000\n" + ] + } + ], + "source": [ + "pd.set_option('display.max_rows', None)\n", + "print(get_percent_missing(housing_data))" + ] + }, + { + "cell_type": "markdown", + "id": "c8eb3ee3-085d-4b41-9a5f-c83a3805f870", + "metadata": {}, + "source": [ + "#### Using Sale price coloumn for KNN and MEAN imputation task" + ] + }, + { + "cell_type": "markdown", + "id": "451c79fb-17ba-40ac-8f0b-87a8b2ec4837", + "metadata": {}, + "source": [ + "#### Non Scaled dataframe Sale Price - take first 1000 rows" + ] + }, + { + "cell_type": "code", + "execution_count": 368, + "id": "9cc1f97f-1b24-4570-8f6a-30426bd79269", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
0208500208500208500208500
1181500181500181500181500
2223500223500223500223500
3140000140000140000140000
4250000250000250000250000
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 208500 208500 208500 208500\n", + "1 181500 181500 181500 181500\n", + "2 223500 223500 223500 223500\n", + "3 140000 140000 140000 140000\n", + "4 250000 250000 250000 250000" + ] + }, + "execution_count": 368, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice = housing_data[['SalePrice']][:1000]\n", + "df_saleprice['sp_copy_1_percent'] = df_saleprice[['SalePrice']]\n", + "df_saleprice['sp_copy_5_percent'] = df_saleprice[['SalePrice']]\n", + "df_saleprice['sp_copy_10_percent'] = df_saleprice[['SalePrice']]\n", + "df_saleprice.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 369, + "id": "f462f065-9f37-44f1-a22e-92e610dae2e9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1000" + ] + }, + "execution_count": 369, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(df_saleprice)" + ] + }, + { + "cell_type": "markdown", + "id": "03407bbd-f8a7-4f6c-a7c3-64a865ed3f7e", + "metadata": {}, + "source": [ + "#### Scaled Dataframe SalePrice - take first 1000 rows" + ] + }, + { + "cell_type": "code", + "execution_count": 370, + "id": "e461b1ef-df2c-410f-aea8-abe954fa9afd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
00.2410780.2410780.2410780.241078
10.2035830.2035830.2035830.203583
20.2619080.2619080.2619080.261908
30.1459520.1459520.1459520.145952
40.2987090.2987090.2987090.298709
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 0.241078 0.241078 0.241078 0.241078\n", + "1 0.203583 0.203583 0.203583 0.203583\n", + "2 0.261908 0.261908 0.261908 0.261908\n", + "3 0.145952 0.145952 0.145952 0.145952\n", + "4 0.298709 0.298709 0.298709 0.298709" + ] + }, + "execution_count": 370, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scaler = MinMaxScaler()\n", + "df_saleprice_scaled = df_saleprice.copy(deep=True)\n", + "df_saleprice_scaled = pd.DataFrame(scaler.fit_transform(df_saleprice_scaled), columns = df_saleprice_scaled.columns)\n", + "df_saleprice_scaled.head()" + ] + }, + { + "cell_type": "markdown", + "id": "a66683c4-f66a-4aa1-ab8a-f28087b60b6c", + "metadata": {}, + "source": [ + "#### Check % missing values in this dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 371, + "id": "0075fa0f-4b82-4089-ab81-e5282497c4a3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice))" + ] + }, + { + "cell_type": "markdown", + "id": "619ef99f-55c0-422c-aaa8-73cd71fcf2fb", + "metadata": {}, + "source": [ + "#### Create 1%, 5% and 10% missing data" + ] + }, + { + "cell_type": "code", + "execution_count": 372, + "id": "82df5098-4176-4fba-922f-ca84c0466f2a", + "metadata": {}, + "outputs": [], + "source": [ + "create_missing(df_saleprice, 0.01, 'sp_copy_1_percent')\n", + "create_missing(df_saleprice, 0.05, 'sp_copy_5_percent')\n", + "create_missing(df_saleprice, 0.1, 'sp_copy_10_percent')" + ] + }, + { + "cell_type": "code", + "execution_count": 373, + "id": "0e90ae04-cd10-4507-a851-c187010f0be0", + "metadata": {}, + "outputs": [], + "source": [ + "create_missing(df_saleprice_scaled, 0.01, 'sp_copy_1_percent')\n", + "create_missing(df_saleprice_scaled, 0.05, 'sp_copy_5_percent')\n", + "create_missing(df_saleprice_scaled, 0.1, 'sp_copy_10_percent')" + ] + }, + { + "cell_type": "markdown", + "id": "a8237a82-5a33-4ce9-b4c7-a48ede4f5fef", + "metadata": {}, + "source": [ + "#### With/Without scaling dataframe missing values check" + ] + }, + { + "cell_type": "code", + "execution_count": 374, + "id": "2794306d-89c7-4518-8979-9edb3d9441b1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice))" + ] + }, + { + "cell_type": "code", + "execution_count": 375, + "id": "8351dbe2-b388-451d-9238-52c4ccabd425", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice_scaled))" + ] + }, + { + "cell_type": "code", + "execution_count": 376, + "id": "b11b093f-110b-4ef3-9d00-ac4fed45a956", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10" + ] + }, + "execution_count": 376, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice['sp_copy_1_percent'].isna().sum()" + ] + }, + { + "cell_type": "markdown", + "id": "360e0010-e085-435c-8902-80c6a7ea78be", + "metadata": {}, + "source": [ + "#### Store indices of missing values" + ] + }, + { + "cell_type": "code", + "execution_count": 377, + "id": "e546096c-ce35-448e-aa97-0943d3535a87", + "metadata": {}, + "outputs": [], + "source": [ + "# Store Index of NaN values in each coloumns\n", + "sp_1_idx = list(np.where(df_saleprice['sp_copy_1_percent'].isna())[0])\n", + "sp_5_idx = list(np.where(df_saleprice['sp_copy_5_percent'].isna())[0])\n", + "sp_10_idx = list(np.where(df_saleprice['sp_copy_10_percent'].isna())[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 378, + "id": "d409e2a5-b3a9-4ae1-9b17-88b7c642692d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_1_idx))\n", + "print(len(sp_5_idx))\n", + "print(len(sp_10_idx))" + ] + }, + { + "cell_type": "code", + "execution_count": 379, + "id": "5839460a-e736-42e9-9a13-d5bab5683115", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Length of sp_1_idx is 10 and it contains 1.0% of total data in column | Total rows: 1000\n", + "Length of sp_5_idx is 50 and it contains 5.0% of total data in column | Total rows: 1000\n", + "Length of sp_10_idx is 100 and it contains 10.0% of total data in column | Total rows: 1000\n" + ] + } + ], + "source": [ + "print(f\"Length of sp_1_idx is {len(sp_1_idx)} and it contains {(len(sp_1_idx)/len(df_saleprice['sp_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")\n", + "print(f\"Length of sp_5_idx is {len(sp_5_idx)} and it contains {(len(sp_5_idx)/len(df_saleprice['sp_copy_5_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")\n", + "print(f\"Length of sp_10_idx is {len(sp_10_idx)} and it contains {(len(sp_10_idx)/len(df_saleprice['sp_copy_10_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")" + ] + }, + { + "cell_type": "markdown", + "id": "c1464c79-c0a9-4640-92dd-f0d5131634ab", + "metadata": {}, + "source": [ + "### Perform KNN to df_saleprice and df_saleprice_scaled dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 380, + "id": "08fa2436-ffb8-4b5d-a7a1-9e2d63b14562", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice1 = df_saleprice.copy(deep=True)\n", + "imputer = KNNImputer(n_neighbors=5)\n", + "imputed_saleprice_df = pd.DataFrame(imputer.fit_transform(df_saleprice1), columns = df_saleprice1.columns)" + ] + }, + { + "cell_type": "code", + "execution_count": 381, + "id": "205c7a96-3f1c-42a4-91de-f22f15ce9cb2", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice_scaled1 = df_saleprice_scaled.copy(deep=True)\n", + "imputer = KNNImputer(n_neighbors=5)\n", + "imputed_saleprice_scaled_df = pd.DataFrame(imputer.fit_transform(df_saleprice_scaled1), columns = df_saleprice_scaled1.columns)" + ] + }, + { + "cell_type": "code", + "execution_count": 382, + "id": "a482f58d-73b6-423c-b97a-140884830a0f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
0208500.0208500.0208500.0208500.0
1181500.0181500.0181500.0181500.0
2223500.0223500.0223500.0223500.0
3140000.0140000.0140000.0140000.0
4250000.0250000.0250000.0250000.0
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 208500.0 208500.0 208500.0 208500.0\n", + "1 181500.0 181500.0 181500.0 181500.0\n", + "2 223500.0 223500.0 223500.0 223500.0\n", + "3 140000.0 140000.0 140000.0 140000.0\n", + "4 250000.0 250000.0 250000.0 250000.0" + ] + }, + "execution_count": 382, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imputed_saleprice_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 383, + "id": "11f8f5ff-f06d-4ec2-a4e3-1324e807a537", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
00.2410780.2410780.2408550.241078
10.2035830.2035830.2035830.203583
20.2619080.2619080.2619080.261908
30.1459520.1459520.1459520.145952
40.2987090.2987090.2987090.298709
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 0.241078 0.241078 0.240855 0.241078\n", + "1 0.203583 0.203583 0.203583 0.203583\n", + "2 0.261908 0.261908 0.261908 0.261908\n", + "3 0.145952 0.145952 0.145952 0.145952\n", + "4 0.298709 0.298709 0.298709 0.298709" + ] + }, + "execution_count": 383, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imputed_saleprice_scaled_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "d9fd7fa1-4ce0-43be-9955-55ef759d930b", + "metadata": {}, + "source": [ + "#### Check % missing in saleprice and saleprice_scaled DF" + ] + }, + { + "cell_type": "code", + "execution_count": 384, + "id": "9ed0d36a-9584-4e3b-9201-2ac36827bce9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(imputed_saleprice_df))" + ] + }, + { + "cell_type": "code", + "execution_count": 385, + "id": "7c842fce-bbd5-4c2c-bb1a-db5df92f6315", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(imputed_saleprice_scaled_df))" + ] + }, + { + "cell_type": "markdown", + "id": "ac47abb1-df5f-4686-bc67-6617140c008c", + "metadata": {}, + "source": [ + "#### Store the list of disfferences between Org. and Imputed Value" + ] + }, + { + "cell_type": "code", + "execution_count": 386, + "id": "99e04554-568d-4efa-a110-768b50dfaee6", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_diff_1 = []\n", + "sp_diff_5 = []\n", + "sp_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(imputed_saleprice_df['sp_copy_1_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", + " sp_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(imputed_saleprice_df['sp_copy_5_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", + " sp_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(imputed_saleprice_df['sp_copy_10_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", + " sp_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 387, + "id": "92204f8a-497c-470d-a770-59165d226cc9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_diff_1))\n", + "print(len(sp_diff_5))\n", + "print(len(sp_diff_10))" + ] + }, + { + "cell_type": "code", + "execution_count": 388, + "id": "b8875fff-0289-4dd9-92c1-78dc9b730d22", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_scaled_diff_1 = []\n", + "sp_scaled_diff_5 = []\n", + "sp_scaled_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(imputed_saleprice_scaled_df['sp_copy_1_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", + " sp_scaled_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(imputed_saleprice_scaled_df['sp_copy_5_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", + " sp_scaled_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(imputed_saleprice_scaled_df['sp_copy_10_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", + " sp_scaled_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 389, + "id": "40192344-79a4-444c-a12a-2201dc5aa0c1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_scaled_diff_1))\n", + "print(len(sp_scaled_diff_5))\n", + "print(len(sp_scaled_diff_10))" + ] + }, + { + "cell_type": "code", + "execution_count": 390, + "id": "a95bd45c-8a2f-4159-8306-399ec18a4c0f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0.0, 0.0, 0.0, 0.0, 0.0]" + ] + }, + "execution_count": 390, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sp_scaled_diff_1[:5]" + ] + }, + { + "cell_type": "code", + "execution_count": 391, + "id": "0f73d420-8842-4062-ae17-158a0a25e169", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[10.0, 20.0, 80.0, 220.0, 0.0]" + ] + }, + "execution_count": 391, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sp_diff_1[:5]" + ] + }, + { + "cell_type": "markdown", + "id": "a40fd400-913b-4011-b0b9-dd3ca0d5827a", + "metadata": {}, + "source": [ + "#### Calculate the mean and var of list of diff. KNN - SalePrice" + ] + }, + { + "cell_type": "code", + "execution_count": 392, + "id": "80267827-7f73-49ff-b200-27cdb2963756", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 170.0 and varience 1% is 42400.0\n", + "The mean of 5% is 444.9439999999997 and varience 5% is 2554554.1584639903\n", + "The mean of 10% is 444.9439999999997 and varience 10% is 6304766.8341439795\n" + ] + } + ], + "source": [ + "m1 = sum(sp_diff_1) / len(sp_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_diff_1) / len(sp_diff_1)\n", + "\n", + "m5 = sum(sp_diff_5) / len(sp_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_diff_5) / len(sp_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_diff_10) / len(sp_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_diff_10) / len(sp_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 393, + "id": "358545ff-2fcf-4c99-9049-4eaf6dd110bd", + "metadata": {}, + "outputs": [], + "source": [ + "df_knn_saleprice = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", + " '5%_saleprice': [m5, var_res5],\n", + " '10%_saleprice': [m10, var_res10]}, orient='index')\n", + "df_knn_saleprice.columns=['diff. list Mean(KNN)', 'diff. list Var.(KNN)']" + ] + }, + { + "cell_type": "code", + "execution_count": 394, + "id": "3714c8f9-58db-40a7-b5a2-6bb7e788b734", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN)diff. list Var.(KNN)
1%_saleprice170.0004.240000e+04
5%_saleprice444.9442.554554e+06
10%_saleprice564.7846.304767e+06
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(KNN) diff. list Var.(KNN)\n", + "1%_saleprice 170.000 4.240000e+04\n", + "5%_saleprice 444.944 2.554554e+06\n", + "10%_saleprice 564.784 6.304767e+06" + ] + }, + "execution_count": 394, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_knn_saleprice" + ] + }, + { + "cell_type": "markdown", + "id": "fd7608a8-c5fb-425c-a340-af01801ee349", + "metadata": {}, + "source": [ + "#### Calculate the mean and var of list of diff. KNN - SalePrice scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 395, + "id": "bb03017f-3d91-48d9-8ebf-7cb5c25fadc3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.0 and varience 1% is 0.0\n", + "The mean of 5% is 2.6301902513541363e-05 and varience 5% is 2.134349753649814e-08\n", + "The mean of 10% is 2.6301902513541363e-05 and varience 10% is 1.417383473391258e-08\n" + ] + } + ], + "source": [ + "m1 = sum(sp_scaled_diff_1) / len(sp_scaled_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_scaled_diff_1) / len(sp_scaled_diff_1)\n", + "\n", + "m5 = sum(sp_scaled_diff_5) / len(sp_scaled_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_scaled_diff_5) / len(sp_scaled_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_scaled_diff_10) / len(sp_scaled_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_scaled_diff_10) / len(sp_scaled_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 396, + "id": "290d8db2-c9f4-4028-ab44-ad68c9e7b3c5", + "metadata": {}, + "outputs": [], + "source": [ + "df_knn_saleprice_scaled = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", + " '5%_saleprice': [m5, var_res5],\n", + " '10%_saleprice': [m10, var_res10]}, orient='index')\n", + "df_knn_saleprice_scaled.columns=['diff. list Mean(KNN) scaled', 'diff. list Var.(KNN) scaled']" + ] + }, + { + "cell_type": "code", + "execution_count": 397, + "id": "89347fd7-d87d-42bb-b375-a75417c395de", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN) scaleddiff. list Var.(KNN) scaled
1%_saleprice0.0000000.000000e+00
5%_saleprice0.0000262.134350e-08
10%_saleprice0.0000321.417383e-08
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(KNN) scaled diff. list Var.(KNN) scaled\n", + "1%_saleprice 0.000000 0.000000e+00\n", + "5%_saleprice 0.000026 2.134350e-08\n", + "10%_saleprice 0.000032 1.417383e-08" + ] + }, + "execution_count": 397, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_knn_saleprice_scaled" + ] + }, + { + "cell_type": "markdown", + "id": "c984dc69-f85f-4f1b-8c94-4afb48c1c8db", + "metadata": {}, + "source": [ + "### Perform MEAN imputation" + ] + }, + { + "cell_type": "code", + "execution_count": 398, + "id": "008bc14f-45e7-42d8-b843-2fee7bcf26c2", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice2 = df_saleprice.copy(deep=True)\n", + "df_saleprice_scaled2 = df_saleprice_scaled.copy(deep=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 399, + "id": "bd71dc1a-f137-46ed-bf2b-f3d87fd4b6a0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice2))" + ] + }, + { + "cell_type": "code", + "execution_count": 400, + "id": "46237cfd-6361-466f-b66f-32f5940149d6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice_scaled2))" + ] + }, + { + "cell_type": "markdown", + "id": "64465299-5620-47b9-a28d-afb5494f279e", + "metadata": {}, + "source": [ + "#### Impute Mean values in missing for saleprice and saleprice_scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 401, + "id": "28cf6b75-eebf-4758-94ec-4b3536f2c659", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice2['sp_copy_1_percent'] = df_saleprice2['sp_copy_1_percent'].fillna(df_saleprice2['sp_copy_1_percent'].mean())\n", + "df_saleprice2['sp_copy_5_percent'] = df_saleprice2['sp_copy_5_percent'].fillna(df_saleprice2['sp_copy_5_percent'].mean())\n", + "df_saleprice2['sp_copy_10_percent'] = df_saleprice2['sp_copy_10_percent'].fillna(df_saleprice2['sp_copy_10_percent'].mean())" + ] + }, + { + "cell_type": "code", + "execution_count": 402, + "id": "2409dd8c-3cd0-4742-b0ac-14dea1fdb504", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice_scaled2['sp_copy_1_percent'] = df_saleprice_scaled2['sp_copy_1_percent'].fillna(df_saleprice_scaled2['sp_copy_1_percent'].mean())\n", + "df_saleprice_scaled2['sp_copy_5_percent'] = df_saleprice_scaled2['sp_copy_5_percent'].fillna(df_saleprice_scaled2['sp_copy_5_percent'].mean())\n", + "df_saleprice_scaled2['sp_copy_10_percent'] = df_saleprice_scaled2['sp_copy_10_percent'].fillna(df_saleprice_scaled2['sp_copy_10_percent'].mean())" + ] + }, + { + "cell_type": "markdown", + "id": "62377754-b682-45e5-8faa-1a4a186bd3c7", + "metadata": {}, + "source": [ + "#### After MEAN imputation - Saleprice and saleprice scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 403, + "id": "6c448556-55f4-4685-aed2-6b67d5ad8a2a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice2))" + ] + }, + { + "cell_type": "code", + "execution_count": 404, + "id": "d9775fbf-7a72-4352-b446-488e9d25b6a2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice_scaled2))" + ] + }, + { + "cell_type": "code", + "execution_count": 407, + "id": "136f87e6-a4af-4229-b36a-695f712deee5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
571120000120000.0120000.000000182343.817778
2223500223500.0223500.000000223500.000000
313375000375000.0375000.000000375000.000000
377340000340000.0182457.342105182343.817778
987395192395192.0395192.000000395192.000000
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "571 120000 120000.0 120000.000000 182343.817778\n", + "2 223500 223500.0 223500.000000 223500.000000\n", + "313 375000 375000.0 375000.000000 375000.000000\n", + "377 340000 340000.0 182457.342105 182343.817778\n", + "987 395192 395192.0 395192.000000 395192.000000" + ] + }, + "execution_count": 407, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice2.sample(5)" + ] + }, + { + "cell_type": "code", + "execution_count": 409, + "id": "784cb61c-78f8-4b31-b709-379c50024dca", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
2160.2431610.2431610.2431610.243161
10.2035830.2035830.2035830.203583
5750.1160950.1160950.1160950.116095
3970.1869180.1869180.1869180.205253
7030.1459520.1459520.1459520.145952
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "216 0.243161 0.243161 0.243161 0.243161\n", + "1 0.203583 0.203583 0.203583 0.203583\n", + "575 0.116095 0.116095 0.116095 0.116095\n", + "397 0.186918 0.186918 0.186918 0.205253\n", + "703 0.145952 0.145952 0.145952 0.145952" + ] + }, + "execution_count": 409, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice_scaled2.sample(5)" + ] + }, + { + "cell_type": "markdown", + "id": "33c1f3b7-5afc-45cb-8b43-9682ec87156d", + "metadata": {}, + "source": [ + "#### Create List of differences for saleprice and saleprice_scaled Dataframes" + ] + }, + { + "cell_type": "code", + "execution_count": 410, + "id": "d2faf410-f83e-4ccb-89d4-e6f8c7adffbb", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_mean_diff_1 = []\n", + "sp_mean_diff_5 = []\n", + "sp_mean_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(df_saleprice2['sp_copy_1_percent'][i] - df_saleprice2['SalePrice'][i])\n", + " sp_mean_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(df_saleprice2['sp_copy_5_percent'][i] - df_saleprice2['SalePrice'][i])\n", + " sp_mean_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(df_saleprice2['sp_copy_10_percent'][i] - df_saleprice2['SalePrice'][i])\n", + " sp_mean_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 411, + "id": "789b07c5-530a-4111-8c97-f5297f7da5e4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_mean_diff_1))\n", + "print(len(sp_mean_diff_5))\n", + "print(len(sp_mean_diff_10))" + ] + }, + { + "cell_type": "code", + "execution_count": 412, + "id": "4fec222c-2420-41af-9e2a-d9773e1d6259", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_scaled_mean_diff_1 = []\n", + "sp_scaled_mean_diff_5 = []\n", + "sp_scaled_mean_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(df_saleprice_scaled2['sp_copy_1_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", + " sp_scaled_mean_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(df_saleprice_scaled2['sp_copy_5_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", + " sp_scaled_mean_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(df_saleprice_scaled2['sp_copy_10_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", + " sp_scaled_mean_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 413, + "id": "de9bf1de-68fe-4894-915a-7069b386123f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_scaled_mean_diff_1))\n", + "print(len(sp_scaled_mean_diff_5))\n", + "print(len(sp_scaled_mean_diff_10))" + ] + }, + { + "cell_type": "markdown", + "id": "f7b93757-d1a7-41a1-85fa-3ee77734be5b", + "metadata": {}, + "source": [ + "#### Calculate mean and var of list of diff. - MEAN impute SalePrice" + ] + }, + { + "cell_type": "code", + "execution_count": 414, + "id": "c60d3aad-33f0-48f4-8bb0-f8af45e33e1e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 55971.63676767676 and varience 1% is 1103367192.190047\n", + "The mean of 5% is 58478.24210526314 and varience 5% is 3139731297.2794733\n", + "The mean of 10% is 58478.24210526314 and varience 10% is 3846674638.263318\n" + ] + } + ], + "source": [ + "m1 = sum(sp_mean_diff_1) / len(sp_mean_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_mean_diff_1) / len(sp_mean_diff_1)\n", + "\n", + "m5 = sum(sp_mean_diff_5) / len(sp_mean_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_mean_diff_5) / len(sp_mean_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_mean_diff_10) / len(sp_mean_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_mean_diff_10) / len(sp_mean_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 415, + "id": "e7f6e5cf-4eaa-4bfe-add2-fc7f600941b7", + "metadata": {}, + "outputs": [], + "source": [ + "df_mean_saleprice = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", + " '5%_saleprice': [m5, var_res5],\n", + " '10%_saleprice': [m10, var_res10]}, orient='index')\n", + "df_mean_saleprice.columns=['diff. list Mean(MI)', 'diff. list Var.(MI)']" + ] + }, + { + "cell_type": "code", + "execution_count": 416, + "id": "cc37eeaf-e3cd-4a83-870d-fab7037eeffe", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI)diff. list Var.(MI)
1%_saleprice55971.6367681.103367e+09
5%_saleprice58478.2421053.139731e+09
10%_saleprice61028.7099113.846675e+09
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(MI) diff. list Var.(MI)\n", + "1%_saleprice 55971.636768 1.103367e+09\n", + "5%_saleprice 58478.242105 3.139731e+09\n", + "10%_saleprice 61028.709911 3.846675e+09" + ] + }, + "execution_count": 416, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_mean_saleprice" + ] + }, + { + "cell_type": "markdown", + "id": "f405f073-1b45-47e8-873b-7a9d34ad0e5c", + "metadata": {}, + "source": [ + "#### Calculate mean and var of list of diff. - MEAN impute SalePrice scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 417, + "id": "2516b4f7-6b79-4636-9bd5-0738343ea355", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.0 and varience 1% is 0.0\n", + "The mean of 5% is 0.00893610697344667 and varience 5% is 0.0014044730755095036\n", + "The mean of 10% is 0.00893610697344667 and varience 10% is 0.0004431848362889144\n" + ] + } + ], + "source": [ + "m1 = sum(sp_scaled_mean_diff_1) / len(sp_scaled_mean_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_scaled_mean_diff_1) / len(sp_scaled_mean_diff_1)\n", + "\n", + "m5 = sum(sp_scaled_mean_diff_5) / len(sp_scaled_mean_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_scaled_mean_diff_5) / len(sp_scaled_mean_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_scaled_mean_diff_10) / len(sp_scaled_mean_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_scaled_mean_diff_10) / len(sp_scaled_mean_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 418, + "id": "fe6a93b8-d6cb-4d7d-856b-ab4ee8fe78fc", + "metadata": {}, + "outputs": [], + "source": [ + "df_mean_saleprice_scaled = pd.DataFrame.from_dict({'1%_saleprice_scaled': [m1, var_res1],\n", + " '5%_saleprice_scaled': [m5, var_res5],\n", + " '10%_saleprice_scaled': [m10, var_res10]}, orient='index')\n", + "df_mean_saleprice_scaled.columns=['diff. list Mean(MI) scaled', 'diff. list Var.(MI) scaled']" + ] + }, + { + "cell_type": "code", + "execution_count": 419, + "id": "e74c35ed-7c2d-44ab-b6c2-4d81c2c6b6bb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI) scaleddiff. list Var.(MI) scaled
1%_saleprice_scaled0.0000000.000000
5%_saleprice_scaled0.0089360.001404
10%_saleprice_scaled0.0074920.000443
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(MI) scaled diff. list Var.(MI) scaled\n", + "1%_saleprice_scaled 0.000000 0.000000\n", + "5%_saleprice_scaled 0.008936 0.001404\n", + "10%_saleprice_scaled 0.007492 0.000443" + ] + }, + "execution_count": 419, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_mean_saleprice_scaled" + ] + }, + { + "cell_type": "markdown", + "id": "876b979a-f5c4-43a7-9ead-d5d866bef078", + "metadata": {}, + "source": [ + "# 2.2 Housing Data Results - KNN and MEAN" + ] + }, + { + "cell_type": "code", + "execution_count": 420, + "id": "fea4b521-03a3-46ce-b217-27225eb868af", + "metadata": {}, + "outputs": [], + "source": [ + "results1 = pd.concat([df_knn_saleprice, df_knn_saleprice_scaled, df_mean_saleprice, df_mean_saleprice_scaled])" + ] + }, + { + "cell_type": "code", + "execution_count": 421, + "id": "631729d6-e853-4ba5-b5fd-4e632ec00d5f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN)diff. list Var.(KNN)diff. list Mean(KNN) scaleddiff. list Var.(KNN) scaleddiff. list Mean(MI)diff. list Var.(MI)diff. list Mean(MI) scaleddiff. list Var.(MI) scaled
1%_saleprice170.0004.240000e+04NaNNaNNaNNaNNaNNaN
5%_saleprice444.9442.554554e+06NaNNaNNaNNaNNaNNaN
10%_saleprice564.7846.304767e+06NaNNaNNaNNaNNaNNaN
1%_salepriceNaNNaN0.0000000.000000e+00NaNNaNNaNNaN
5%_salepriceNaNNaN0.0000262.134350e-08NaNNaNNaNNaN
10%_salepriceNaNNaN0.0000321.417383e-08NaNNaNNaNNaN
1%_salepriceNaNNaNNaNNaN55971.6367681.103367e+09NaNNaN
5%_salepriceNaNNaNNaNNaN58478.2421053.139731e+09NaNNaN
10%_salepriceNaNNaNNaNNaN61028.7099113.846675e+09NaNNaN
1%_saleprice_scaledNaNNaNNaNNaNNaNNaN0.0000000.000000
5%_saleprice_scaledNaNNaNNaNNaNNaNNaN0.0089360.001404
10%_saleprice_scaledNaNNaNNaNNaNNaNNaN0.0074920.000443
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(KNN) diff. list Var.(KNN) \\\n", + "1%_saleprice 170.000 4.240000e+04 \n", + "5%_saleprice 444.944 2.554554e+06 \n", + "10%_saleprice 564.784 6.304767e+06 \n", + "1%_saleprice NaN NaN \n", + "5%_saleprice NaN NaN \n", + "10%_saleprice NaN NaN \n", + "1%_saleprice NaN NaN \n", + "5%_saleprice NaN NaN \n", + "10%_saleprice NaN NaN \n", + "1%_saleprice_scaled NaN NaN \n", + "5%_saleprice_scaled NaN NaN \n", + "10%_saleprice_scaled NaN NaN \n", + "\n", + " diff. list Mean(KNN) scaled \\\n", + "1%_saleprice NaN \n", + "5%_saleprice NaN \n", + "10%_saleprice NaN \n", + "1%_saleprice 0.000000 \n", + "5%_saleprice 0.000026 \n", + "10%_saleprice 0.000032 \n", + "1%_saleprice NaN \n", + "5%_saleprice NaN \n", + "10%_saleprice NaN \n", + "1%_saleprice_scaled NaN \n", + "5%_saleprice_scaled NaN \n", + "10%_saleprice_scaled NaN \n", + "\n", + " diff. list Var.(KNN) scaled diff. list Mean(MI) \\\n", + "1%_saleprice NaN NaN \n", + "5%_saleprice NaN NaN \n", + "10%_saleprice NaN NaN \n", + "1%_saleprice 0.000000e+00 NaN \n", + "5%_saleprice 2.134350e-08 NaN \n", + "10%_saleprice 1.417383e-08 NaN \n", + "1%_saleprice NaN 55971.636768 \n", + "5%_saleprice NaN 58478.242105 \n", + "10%_saleprice NaN 61028.709911 \n", + "1%_saleprice_scaled NaN NaN \n", + "5%_saleprice_scaled NaN NaN \n", + "10%_saleprice_scaled NaN NaN \n", + "\n", + " diff. list Var.(MI) diff. list Mean(MI) scaled \\\n", + "1%_saleprice NaN NaN \n", + "5%_saleprice NaN NaN \n", + "10%_saleprice NaN NaN \n", + "1%_saleprice NaN NaN \n", + "5%_saleprice NaN NaN \n", + "10%_saleprice NaN NaN \n", + "1%_saleprice 1.103367e+09 NaN \n", + "5%_saleprice 3.139731e+09 NaN \n", + "10%_saleprice 3.846675e+09 NaN \n", + "1%_saleprice_scaled NaN 0.000000 \n", + "5%_saleprice_scaled NaN 0.008936 \n", + "10%_saleprice_scaled NaN 0.007492 \n", + "\n", + " diff. list Var.(MI) scaled \n", + "1%_saleprice NaN \n", + "5%_saleprice NaN \n", + "10%_saleprice NaN \n", + "1%_saleprice NaN \n", + "5%_saleprice NaN \n", + "10%_saleprice NaN \n", + "1%_saleprice NaN \n", + "5%_saleprice NaN \n", + "10%_saleprice NaN \n", + "1%_saleprice_scaled 0.000000 \n", + "5%_saleprice_scaled 0.001404 \n", + "10%_saleprice_scaled 0.000443 " + ] + }, + "execution_count": 421, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "results1" + ] + }, + { + "cell_type": "code", + "execution_count": 422, + "id": "a255c5bc-c062-4029-8f18-0c7644ca1d7c", + "metadata": {}, + "outputs": [], + "source": [ + "results1.to_csv('housing_data_saleprice_KNN_Mean_results.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9b0060e-129c-465e-a2a5-c3113ac4b936", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "pytorch_kz_env", + "language": "python", + "name": "pytorch_kz_env" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/random_numbers_1000.csv b/notebooks/random_numbers_1000.csv new file mode 100644 index 0000000..b988bad --- /dev/null +++ b/notebooks/random_numbers_1000.csv @@ -0,0 +1,1001 @@ +,number +0,0.14461602473455892 +1,0.07751503129173953 +2,0.15593297226701996 +3,0.09720879582042008 +4,0.32375017402684214 +5,0.686823745565341 +6,0.7068035159437503 +7,0.9167216890721541 +8,0.6352048775376901 +9,0.17132904054220055 +10,0.8159661332230377 +11,0.16475992352396795 +12,0.0409370627667629 +13,0.16726651783050572 +14,0.9709841404608549 +15,0.7314646963631376 +16,0.3426860074270154 +17,0.03452867763070577 +18,0.3574832521777054 +19,0.5745017180628896 +20,0.9464018964648249 +21,0.17346442317598176 +22,0.7981877585797893 +23,0.7809787573425518 +24,0.5238193208352585 +25,0.7821735568253659 +26,0.9934482007890996 +27,0.4184423331593896 +28,0.2599014381523176 +29,0.79832254805514 +30,0.6041862665264831 +31,0.3819864440431342 +32,0.8521701748665009 +33,0.3126469510739037 +34,0.573165703657289 +35,0.6265563684951247 +36,0.739416657331853 +37,0.012060677103418738 +38,0.9526287180476393 +39,0.3919187115227588 +40,0.2638910529614693 +41,0.28055121530104343 +42,0.5573435702875359 +43,0.810470016341365 +44,0.5595615325523974 +45,0.408760756112558 +46,0.8630495060594643 +47,0.8614542990838314 +48,0.8236790421079785 +49,0.445107982060686 +50,0.9240480240430241 +51,0.17212099430841699 +52,0.2821871607285322 +53,0.37501938886942654 +54,0.4401439635045862 +55,0.1316322082815632 +56,0.06144638522796442 +57,0.9719025725097523 +58,0.6437628611013991 +59,0.18965508288943556 +60,0.06647339880458658 +61,0.9432875072199843 +62,0.9635593500723799 +63,0.8159138106628153 +64,0.5268141359426226 +65,0.8097577290919002 +66,0.10832871122562193 +67,0.513926863373751 +68,0.5574679474011387 +69,0.23117155673924017 +70,0.7988683863124257 +71,0.14232155967666804 +72,0.4114506075996932 +73,0.028703811806714996 +74,0.15511224785648736 +75,0.5179635133770123 +76,0.6343922699321491 +77,0.5442703351502044 +78,0.2051777299642784 +79,0.9514959457303863 +80,0.8616963431169906 +81,0.9260797192939593 +82,0.6837050092238902 +83,0.6341651538285088 +84,0.47009701258761005 +85,0.6290009641921982 +86,0.9976095248457479 +87,0.6766165875739423 +88,0.34775785853790764 +89,0.24721164403263118 +90,0.7644613432516099 +91,0.8578411267105046 +92,0.02847593788616165 +93,0.7352417508308864 +94,0.6439666934556955 +95,0.4145386388213331 +96,0.9000774058908544 +97,0.20984159212668807 +98,0.5736834527493817 +99,0.5731814122745401 +100,0.39175113064248857 +101,0.9414042225202869 +102,0.35865018640717594 +103,0.34942147114579614 +104,0.6287322577319368 +105,0.5640558939154473 +106,0.9935619072485498 +107,0.3230972874260011 +108,0.30050448033239197 +109,0.8535359869169682 +110,0.8186071691655027 +111,0.8507126794809163 +112,0.11848293702439716 +113,0.34039997170201786 +114,0.24848934681272938 +115,0.8713564278618446 +116,0.7192981378269337 +117,0.5612771185476495 +118,0.3001718489057721 +119,0.5582566234063182 +120,0.20715922789136187 +121,0.24718349962906172 +122,0.9096809353144786 +123,0.9496126251594162 +124,0.19298962232482253 +125,0.6823143045816399 +126,0.2950869303839806 +127,0.700872866143569 +128,0.9246255564110638 +129,0.3918411220739513 +130,0.5046695500081352 +131,0.40242035593564884 +132,0.5348070625842399 +133,0.6190144238291141 +134,0.6527067332418969 +135,0.7798534811708006 +136,0.8371435153002993 +137,0.7256654504898371 +138,0.19486710733751433 +139,0.17061227388763445 +140,0.3866266766943538 +141,0.9861342050546121 +142,0.12499832976125236 +143,0.4076100289319884 +144,0.24405060656519562 +145,0.24658924623282708 +146,0.31303910086742404 +147,0.13582549628998997 +148,0.4267352707490074 +149,0.6860815270131422 +150,0.2632104445655937 +151,0.7095899448677616 +152,0.30697391312148903 +153,0.15020764760355143 +154,0.33008237434926957 +155,0.24730791798017127 +156,0.7732146302465086 +157,0.3986960975344779 +158,0.878302550945857 +159,0.3073561016445441 +160,0.21123045619113257 +161,0.5806664509148879 +162,0.8984369263318096 +163,0.8363942698985983 +164,0.2812623945509036 +165,0.10724622968453401 +166,0.5703943012638906 +167,0.7309007201275504 +168,0.6865969394598082 +169,0.17355862259884247 +170,0.41747139600619776 +171,0.8046329781439144 +172,0.29734663284924356 +173,0.6874907011989809 +174,0.27926268019676004 +175,0.16857167772740067 +176,0.808320103826969 +177,0.22397888146185907 +178,0.4961137292567884 +179,0.39791460648438426 +180,0.749624236829485 +181,0.8166672255804612 +182,0.5416591595071085 +183,0.7784968348980786 +184,0.5246274130247313 +185,0.6165788811775392 +186,0.18993747860389354 +187,0.4375903866391334 +188,0.8977799452863308 +189,0.8974808404906014 +190,0.7833353163003136 +191,0.5735505446147654 +192,0.8592478266591742 +193,0.555628191461239 +194,0.29218190018690193 +195,0.6823254024415241 +196,0.7253556992028032 +197,0.6348979373592366 +198,0.738955355288769 +199,0.40548956817360793 +200,0.9965074549246696 +201,0.6680475408833246 +202,0.4753087000915296 +203,0.8154531729554498 +204,0.39674071637462927 +205,0.3465424212251109 +206,0.3010873336265142 +207,0.3453059844140016 +208,0.3376649450698975 +209,0.4520568726021712 +210,0.7102711170123417 +211,0.5676304992868505 +212,0.246451823758292 +213,0.3045971494873321 +214,0.9191799326806603 +215,0.09062317707388845 +216,0.6456030768852257 +217,0.8145182625891805 +218,0.3502989381872097 +219,0.5454669053640021 +220,0.9229510982790893 +221,0.5017605011244138 +222,0.5814298938642755 +223,0.212077064497179 +224,0.9084673048697015 +225,0.8420689009087419 +226,0.09544595716628035 +227,0.5428219386076877 +228,0.334040059452826 +229,0.5883742904617911 +230,0.6681527250828868 +231,0.920066967991107 +232,0.6980014815164323 +233,0.5140583511099508 +234,0.574062901794968 +235,0.8671650796521554 +236,0.29309281744572635 +237,0.6255644089859125 +238,0.41377688075614283 +239,0.6541722779053092 +240,0.7022455597573617 +241,0.7027961835253476 +242,0.32866027307469425 +243,0.9438823677034145 +244,0.6392304917718383 +245,0.35610068008813955 +246,0.5109988272940061 +247,0.7549785046509206 +248,0.911498498846909 +249,0.7269132750864981 +250,0.43346849143235944 +251,0.9613052659398792 +252,0.06410207161162618 +253,0.7224542800953787 +254,0.8605028822342475 +255,0.9379303538857604 +256,0.11890111097053702 +257,0.06560232272410749 +258,0.9815175258058294 +259,0.5816233574934034 +260,0.3223771211316614 +261,0.010794999021216611 +262,0.48232848210912416 +263,0.6888091652734284 +264,0.7510123953710294 +265,0.3931342633771988 +266,0.4285185942589612 +267,0.028804295777431044 +268,0.7471054611787746 +269,0.5188475627728396 +270,0.3699806335289325 +271,0.6733240981418717 +272,0.455659972278607 +273,0.8865920570538507 +274,0.9773310825483524 +275,0.9114683092627319 +276,0.7234740743957591 +277,0.47378640650570536 +278,0.9044322182580692 +279,0.6490971485609244 +280,0.9325706015784121 +281,0.15806103989245135 +282,0.20431604755502109 +283,0.9516960107212825 +284,0.17933034496530176 +285,0.10632943259433447 +286,0.20529052976827733 +287,0.26644977396966907 +288,0.990842732357776 +289,0.6626056375310618 +290,0.8934023242009224 +291,0.6087787761836707 +292,0.6622123753279109 +293,0.2795715500728444 +294,0.7356211918792761 +295,0.023450952083761578 +296,0.29930766895885463 +297,0.9605253146799532 +298,0.4773205356946918 +299,0.896685482640458 +300,0.20788119046629716 +301,0.21907107928738412 +302,0.3417751133430835 +303,0.8785812995819484 +304,0.7629857606713326 +305,0.10409839946928867 +306,0.5375122454578438 +307,0.12610808266796247 +308,0.9207106566062669 +309,0.6614470367535862 +310,0.6646296886200127 +311,0.02517423927343887 +312,0.5355435671395777 +313,0.9639505712726043 +314,0.8427700240424094 +315,0.5173256280251634 +316,0.6809361625916177 +317,0.25269387981635383 +318,7.39014254360626e-05 +319,0.6832379417409375 +320,0.3814705574477538 +321,0.2953366513034189 +322,0.8601629667491553 +323,0.4116625534183441 +324,0.20248827761656263 +325,0.0950677170887495 +326,0.37432668808858527 +327,0.5002586204770462 +328,0.5903766299860601 +329,0.4069147751233232 +330,0.46587616114566655 +331,0.20767274566478722 +332,0.4405095567714371 +333,0.7561490702983013 +334,0.9691510044256642 +335,0.9835349892112961 +336,0.08167974686852508 +337,0.011831197129136273 +338,0.2533369151703784 +339,0.7258386397040382 +340,0.1533224004672512 +341,0.16976063838308353 +342,0.3535761067133554 +343,0.9558080514913609 +344,0.34787269425215606 +345,0.6384858181781367 +346,0.19142808499268715 +347,0.3723886499126876 +348,0.4610104267479409 +349,0.7386414627232165 +350,0.5547224736511918 +351,0.07560627992824742 +352,0.38543929036328295 +353,0.023870001618478964 +354,0.08490118558975879 +355,0.9523181200843006 +356,0.835121255953561 +357,0.8313253101018512 +358,0.4477164423027221 +359,0.427173224834863 +360,0.2607502696316568 +361,0.6518880149684392 +362,0.989596091701078 +363,0.4737188317675711 +364,0.951663574431818 +365,0.6389835611029937 +366,0.4255250760028354 +367,0.36494823219306194 +368,0.10394871793754767 +369,0.08787887115953141 +370,0.05185866702404662 +371,0.5729228447658512 +372,0.3557153056497062 +373,0.14169200930635462 +374,0.6026259214704931 +375,0.6780938325392907 +376,0.0019220493053816456 +377,0.14423401505903843 +378,0.31021740847078616 +379,0.26542859991807166 +380,0.05293698137098246 +381,0.5447383348415423 +382,0.19410883367100906 +383,0.2759766462115508 +384,0.6085305795585376 +385,0.19018564330800136 +386,0.6001023952936514 +387,0.5500869240450543 +388,0.308558554189692 +389,0.613015054522192 +390,0.5053671279653127 +391,0.8033565610860482 +392,0.3190316438196028 +393,0.8430688477494918 +394,0.3907441626865247 +395,0.3749010705929905 +396,0.20374147066354986 +397,0.4445572005828903 +398,0.4325615226381033 +399,0.747347832034453 +400,0.1408237945119577 +401,0.5629196065967164 +402,0.8883715667513505 +403,0.7262344816634011 +404,0.1015240156369166 +405,0.6274596622730756 +406,0.6724938834493908 +407,0.45890555605876826 +408,0.253862163313197 +409,0.20213399227024142 +410,0.9431472444002996 +411,0.4412716272261822 +412,0.6778537756613036 +413,0.5609208700560778 +414,0.7852790417028147 +415,0.8301487622409094 +416,0.0695242591856422 +417,0.5342345164968271 +418,0.020198821857018268 +419,0.11932836566667071 +420,0.7351542137502673 +421,0.879354084852934 +422,0.060390921051916124 +423,0.3517659280158124 +424,0.25831407832342757 +425,0.25041309629182773 +426,0.6324032934179679 +427,0.6905116746744266 +428,0.038781141504878325 +429,0.11872222658971077 +430,0.3402172182577837 +431,0.1117834948318035 +432,0.8974663997148172 +433,0.7721061886641211 +434,0.467763325594456 +435,0.45960484726135 +436,0.11940893902740168 +437,0.8892320824757846 +438,0.056170722740824464 +439,0.8348974660229447 +440,0.8328276290445746 +441,0.015421942378315512 +442,0.6078039146470725 +443,0.9797170916017848 +444,0.817871594488278 +445,0.4281570072853328 +446,0.9826586617461194 +447,0.5714323337805088 +448,0.5655480118995616 +449,0.13163751508874266 +450,0.5727166298844355 +451,0.3876989055629705 +452,0.24625748760449773 +453,0.062376725489559304 +454,0.1868295868142189 +455,0.07519337399332371 +456,0.8615125038568271 +457,0.0430765434686432 +458,0.7784279481001283 +459,0.1559200654309939 +460,0.28457480300272475 +461,0.4833371043049315 +462,0.21688560355701902 +463,0.051055375260327884 +464,0.8764119752087609 +465,0.03830180552041673 +466,0.899276170682331 +467,0.5326669068942715 +468,0.7966592760107886 +469,0.5977938689767619 +470,0.35735055753216216 +471,0.7502306585594846 +472,0.27262195939610845 +473,0.3367003915054816 +474,0.3718378858875636 +475,0.7252726856566986 +476,0.6108078470654391 +477,0.160140124957443 +478,0.640641195165919 +479,0.819043970313203 +480,0.9460930077740923 +481,0.3955113176387407 +482,0.08228064172201954 +483,0.5692148152461914 +484,0.9379027430417781 +485,0.7262721958954546 +486,0.9974714724600596 +487,0.9816411645054782 +488,0.02801478549452141 +489,0.35876394018958924 +490,0.46224300725504386 +491,0.07977812492324099 +492,0.7825821331768681 +493,0.7728747320072956 +494,0.18411522733742114 +495,0.9349933626453013 +496,0.3305156463539396 +497,0.05247324921620988 +498,0.3784435570491954 +499,0.8296025413407634 +500,0.44108727645927825 +501,0.2993358032378495 +502,0.8631126359025391 +503,0.250262827945147 +504,0.09566738091105942 +505,0.7130474946994906 +506,0.2235781443128807 +507,0.7026149405611689 +508,0.7224945548679957 +509,0.6170012611217315 +510,0.20186432914831431 +511,0.7852714452298651 +512,0.8903242744728199 +513,0.1399056906045737 +514,0.17026945833848617 +515,0.514586763470415 +516,0.9736100357614889 +517,0.7746591507784915 +518,0.29437001890274195 +519,0.8027253084378705 +520,0.08386991518130038 +521,0.09136100092018629 +522,0.8983567502463687 +523,0.8868693311046169 +524,0.533466309836137 +525,0.42900189716927073 +526,0.1821870276409372 +527,0.4315150943786541 +528,0.47383956070476785 +529,0.42647315825719867 +530,0.20889106515275513 +531,0.15615589390655582 +532,0.7683598815481214 +533,0.8407774935346721 +534,0.4599058924434972 +535,0.20858605861422153 +536,0.25419023941340724 +537,0.03537597137641857 +538,0.5037011171417803 +539,0.319855948227728 +540,0.6143932185624659 +541,0.11338109816795006 +542,0.6071773224023549 +543,0.6320103598568474 +544,0.17739418618305125 +545,0.9193076779462215 +546,0.539317629461803 +547,0.361121293498606 +548,0.8225521587592494 +549,0.037067189096233966 +550,0.7644376889628157 +551,0.9614375433647248 +552,0.26247829558958613 +553,0.04497704041286332 +554,0.49347237237561237 +555,0.10135820428850206 +556,0.9054759324635467 +557,0.3912479745377101 +558,0.16984308812935767 +559,0.3130327921420567 +560,0.2845393861009978 +561,0.7216547111114262 +562,0.6129838442158642 +563,0.6128072542663652 +564,0.5153838338789999 +565,0.7131085367862817 +566,0.8713477772442941 +567,0.9419360672901563 +568,0.9061770339937525 +569,0.9973713503589123 +570,0.6511737928834931 +571,0.0980714039543844 +572,0.12371358453480508 +573,0.5817580949438432 +574,0.3878197750090975 +575,0.3836838844640248 +576,0.3330772932400339 +577,0.8937920239990277 +578,0.42660379831271933 +579,0.09749777821209016 +580,0.03273234283716975 +581,0.5822939987582022 +582,0.2818759219290342 +583,0.9973773382690185 +584,0.3485811650096795 +585,0.38385951065171464 +586,0.14314846321555819 +587,0.41168484188278187 +588,0.5560325831949468 +589,0.6786651527115524 +590,0.27941662328630534 +591,0.12758615070559087 +592,0.8706880276786881 +593,0.42247163006009736 +594,0.8747921784321767 +595,0.9819789489386005 +596,0.53212913612486 +597,0.6820548577830702 +598,0.14172556124342628 +599,0.8954903213991394 +600,0.8877895505948118 +601,0.2899734461911796 +602,0.39888758518426926 +603,0.5085270928974726 +604,0.5397323464650328 +605,0.5355595876880633 +606,0.6680045600991499 +607,0.07890855054344348 +608,0.36522753036507116 +609,0.7525828516063231 +610,0.8155334605307646 +611,0.948872329161571 +612,0.10085424156574552 +613,0.3063104444859259 +614,0.012248867459916157 +615,0.8332405266792986 +616,0.4477328006875678 +617,0.7381760858313725 +618,0.5381307278002123 +619,0.64442652761133 +620,0.407653279216153 +621,0.988120343671508 +622,0.349242158981631 +623,0.11439639275168989 +624,0.773600974105568 +625,0.3422508667504136 +626,0.35092901992304426 +627,0.6998555631853256 +628,0.5351463864628954 +629,0.6941915466139217 +630,0.27550090759498 +631,0.03955870654832727 +632,0.9737612333749457 +633,0.85659566451438 +634,0.318016024519294 +635,0.07264967870375483 +636,0.6266672136646679 +637,0.5427530067840908 +638,0.08013357115177333 +639,0.27865447324993387 +640,0.8204327600278204 +641,0.6472338718548233 +642,0.8981066937808309 +643,0.9904134149156683 +644,0.7570648348954108 +645,0.04820939759809295 +646,0.49659488586991385 +647,0.2681871451946377 +648,0.05376519761698151 +649,0.1536101940376925 +650,0.2458849441738461 +651,0.19991898782481343 +652,0.49815295225863154 +653,0.7475145062482099 +654,0.5814474904248211 +655,0.9103815228294841 +656,0.8091439841662771 +657,0.044556478634595 +658,0.06582839484468272 +659,0.8723124347377673 +660,0.761407419742959 +661,0.6295611439582762 +662,0.5602756647971817 +663,0.028833108636930782 +664,0.6925154173449602 +665,0.30781547100300766 +666,0.9456746547718861 +667,0.7733519530494579 +668,0.07325928323474962 +669,0.06051359621130603 +670,0.7684091239449635 +671,0.0772898478864189 +672,0.4652145959688888 +673,0.4373876627767307 +674,0.6267684478070814 +675,0.7183418633741062 +676,0.28256468766217413 +677,0.5073826011665699 +678,0.31820311938601464 +679,0.4089168748142118 +680,0.29885921770184043 +681,0.03372851278925548 +682,0.6703170306185748 +683,0.33198869826189814 +684,0.5975405123566822 +685,0.8211657963714585 +686,0.3461079054656666 +687,0.48616250243415104 +688,0.13447950866733605 +689,0.562667191415577 +690,0.7678216928305848 +691,0.4530052286033409 +692,0.5010228200975811 +693,0.4323309760765164 +694,0.36743023729184987 +695,0.1723991626473217 +696,0.4337302869241262 +697,0.24966845326719822 +698,0.642167289966723 +699,0.616830008851879 +700,0.7703637450499222 +701,0.21386173939654995 +702,0.704115745850898 +703,0.6905967742396926 +704,0.14550064889741277 +705,0.6045853103312959 +706,0.03670533871021342 +707,0.7158949195594291 +708,0.5963326610400751 +709,0.7656919572130952 +710,0.16593604258736716 +711,0.37116447793513807 +712,0.8005826062394383 +713,0.041771054650389106 +714,0.6847846478124059 +715,0.4993883882765534 +716,0.1850707225574446 +717,0.5630874044249621 +718,0.37025234599378876 +719,0.7107125656980158 +720,0.4118677519270143 +721,0.7742568360649871 +722,0.8100159822588088 +723,0.3174629757017041 +724,0.5303493054894146 +725,0.8849961235045513 +726,0.3273403729546115 +727,0.6172150375830504 +728,0.15983060531231819 +729,0.4728594510763161 +730,0.4529506215548965 +731,0.5035430872599636 +732,0.004927231548344402 +733,0.1940383807540148 +734,0.14982458424309364 +735,0.8563549025851751 +736,0.03884058951015723 +737,0.28522238435867453 +738,0.8057900651211597 +739,0.03021709036511122 +740,0.07224489509195386 +741,0.056610587902518716 +742,0.9264467821014194 +743,0.8138662549320123 +744,0.41783822642927937 +745,0.8723047253359363 +746,0.18136207963463802 +747,0.7164025688996778 +748,0.8196872616954788 +749,0.8068822585021751 +750,0.007129291396152926 +751,0.2602504030386925 +752,0.46370562857123043 +753,0.163784347412389 +754,0.23315134483036648 +755,0.6177440123966893 +756,0.2561521510607473 +757,0.562548076892661 +758,0.5051861935336659 +759,0.13892890236963107 +760,0.004539613445676105 +761,0.17372524036846493 +762,0.6832015932759417 +763,0.8325857535808265 +764,6.826981312790803e-05 +765,0.19612584863473537 +766,0.4145509719106246 +767,0.2619625834737831 +768,0.24549665294458467 +769,0.27612714237335956 +770,0.8531795517703349 +771,0.047146001044882424 +772,0.562788499298586 +773,0.43099863376962144 +774,0.26050958743406505 +775,0.7788002061420074 +776,0.6743332176478016 +777,0.40066992822420555 +778,0.9760876856806906 +779,0.539119034171984 +780,0.18208901259127885 +781,0.12376735142175199 +782,0.9551514655114575 +783,0.7810294736400567 +784,0.9212583468427701 +785,0.8010043139785669 +786,0.22944051406680832 +787,0.050052241727377766 +788,0.6786745563768194 +789,0.429793629888368 +790,0.42563361699182967 +791,0.6784838537337905 +792,0.2858761720399675 +793,0.2890895011305119 +794,0.025121632825633844 +795,0.25765509253553054 +796,0.43572322499776717 +797,0.6647102169428171 +798,0.10847616026636064 +799,0.2537450603718995 +800,0.24416864473064126 +801,0.0672514263787497 +802,0.16935229953659314 +803,0.27439580112524253 +804,0.4284736191801598 +805,0.8586734606964571 +806,0.4315781202007021 +807,0.09915635234890208 +808,0.44899905032025744 +809,0.013316716483281699 +810,0.8391449274551819 +811,0.5061770521104294 +812,0.0672045714638001 +813,0.2933544809181752 +814,0.18022127393582965 +815,0.8781136361676581 +816,0.5157135259800142 +817,0.46243072336418334 +818,0.6222491687600095 +819,0.8889053056935484 +820,0.04571095891205823 +821,0.1513640763692672 +822,0.7774449453314359 +823,0.5183880690457242 +824,0.2921720252636122 +825,0.09168278609192515 +826,0.39002371887786735 +827,0.3580585061283823 +828,0.12047021435718164 +829,0.6738337221623005 +830,0.21958552211366156 +831,0.5648142473736366 +832,0.23497653874753555 +833,0.16544595712611387 +834,0.040561694693181605 +835,0.7355715205459343 +836,0.9004365787736869 +837,0.5459151013055901 +838,0.7480058346265005 +839,0.7141260383574005 +840,0.1158157631511092 +841,0.9125379342891712 +842,0.3680018768100638 +843,0.7402206231811581 +844,0.2972738079840226 +845,0.8923504613507662 +846,0.5063568640229354 +847,0.24619949696371157 +848,0.5399981903000146 +849,0.7188539530946122 +850,0.648195890336554 +851,0.724518894463568 +852,0.14288147919479144 +853,0.7994514226699949 +854,0.6226355760247099 +855,0.010176035425188967 +856,0.4131692686695717 +857,0.834692399566853 +858,0.49912957372925004 +859,0.00438814293685974 +860,0.3252041908817417 +861,0.534840233118543 +862,0.3587118743837924 +863,0.9677560902733098 +864,0.5973183201684436 +865,0.296691425381007 +866,0.5855079326424412 +867,0.20240300955532187 +868,0.6021550529096645 +869,0.8824421051967469 +870,0.3072946199859422 +871,0.3128979438155097 +872,0.5475105438225643 +873,0.4842448962628426 +874,0.15025538438496855 +875,0.310622456701922 +876,0.6023436011138587 +877,0.5754165898365287 +878,0.6577607923072721 +879,0.7857515237431592 +880,0.22057576301022253 +881,0.8661095076438114 +882,0.910244039608377 +883,0.578456971142587 +884,0.3787935162597653 +885,0.08939098828841929 +886,0.9232626564888574 +887,0.1712490756353049 +888,0.779216672902944 +889,0.3495372334946847 +890,0.47001887737996617 +891,0.29750226759355936 +892,0.2810128485470573 +893,0.2437794575755069 +894,0.2624381305719474 +895,0.8246608579175856 +896,0.6942956761673141 +897,0.11515579868519688 +898,0.1206162339748359 +899,0.26196220525263014 +900,0.5553026135773536 +901,0.40720637901420265 +902,0.9638145298530792 +903,0.4117628415691498 +904,0.31618951259604455 +905,0.11765701103218917 +906,0.33470652854411564 +907,0.7366235956449027 +908,0.7581529716898141 +909,0.9554767313213507 +910,0.8837680591214232 +911,0.12426303151941864 +912,0.13192594906673982 +913,0.13159583337236658 +914,0.8413301780622977 +915,0.5495370639785346 +916,0.8125566245605387 +917,0.764454058143039 +918,0.9022709587116715 +919,0.22879685531861071 +920,0.49057430203325403 +921,0.4724960647844604 +922,0.8055598260756343 +923,0.7603094118394911 +924,0.3728373302689516 +925,0.3568389711535207 +926,0.4241494594670866 +927,0.7538918294606227 +928,0.5278021541536974 +929,0.4605573424438759 +930,0.6738635250250887 +931,0.16054005910324365 +932,0.8428762894592794 +933,0.9518468101445031 +934,0.32776599980321264 +935,0.3459454626103713 +936,0.08290510118997685 +937,0.4134429089919419 +938,0.7577633137424186 +939,0.4360752405153524 +940,0.977898855124461 +941,0.3899549115493246 +942,0.07360874043480192 +943,0.6234394805204561 +944,0.8281399000229284 +945,0.5936401403938281 +946,0.9444301233719021 +947,0.18311569423561358 +948,0.19900897833219744 +949,0.5859537329420677 +950,0.45369641243149117 +951,0.8140494291811821 +952,0.15504116789135103 +953,0.5097058344234562 +954,0.46015129255339193 +955,0.9168374769143446 +956,0.6646855362668478 +957,0.08710995188842596 +958,0.9648211892689712 +959,0.3099412950871465 +960,0.4182764603873177 +961,0.2811470272374724 +962,0.36150098707209977 +963,0.7547921114548144 +964,0.038441021458981206 +965,0.6114605284345398 +966,0.20333754648264146 +967,0.6879693726518868 +968,0.5615887399000671 +969,0.10931708773465398 +970,0.8275712918793767 +971,0.7747109160797243 +972,0.9005913428689535 +973,0.6399242580079716 +974,0.717434307883715 +975,0.0782758727785875 +976,0.05968847507483932 +977,0.9824576958211914 +978,0.02495988725135534 +979,0.2620968894854523 +980,0.010107863826380292 +981,0.2764875736254404 +982,0.18403412415931986 +983,0.1616789092290818 +984,0.3454521050417132 +985,0.433499552863608 +986,0.040911884966301715 +987,0.20484238883308725 +988,0.6675520566953549 +989,0.6160709258598361 +990,0.04474552091720452 +991,0.40241951588041347 +992,0.5873473825076658 +993,0.38212818142632543 +994,0.8770948644179681 +995,0.18210726703943658 +996,0.7879879363150989 +997,0.14870738186047538 +998,0.15312132054135852 +999,0.4747372545447177 From b842be67382020e649f4b117b2d986021d60ea3d Mon Sep 17 00:00:00 2001 From: Shesh Narayan Gupta <91396937+SheshNGupta@users.noreply.github.com> Date: Mon, 6 Jun 2022 18:47:19 -0400 Subject: [PATCH 2/8] Delete Imputation_best_practices.ipynb --- notebooks/Imputation_best_practices.ipynb | 4557 --------------------- 1 file changed, 4557 deletions(-) delete mode 100644 notebooks/Imputation_best_practices.ipynb diff --git a/notebooks/Imputation_best_practices.ipynb b/notebooks/Imputation_best_practices.ipynb deleted file mode 100644 index 87d582d..0000000 --- a/notebooks/Imputation_best_practices.ipynb +++ /dev/null @@ -1,4557 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "e2ceaeb0-e282-4c63-97e2-f1dd03810aa2", - "metadata": {}, - "source": [ - "# What to try in this notebook?\n", - "\n", - "#### 1. Get a random number generated dataset from kaggle, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", - "\n", - "Dataset - https://www.kaggle.com/timoboz/random-numbers\n", - "\n", - "#### 2. Use a housing dataset from UCI, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", - "\n", - "Dataset - https://github.com/nikbearbrown/AI_Research_Group/blob/main/Awesome-UCI-Datasets/Classification/House_Price_predication/train.csv" - ] - }, - { - "cell_type": "code", - "execution_count": 101, - "id": "d8fe4103-6e71-4b97-810c-b599a0482944", - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import numpy as np\n", - "import warnings\n", - "warnings.filterwarnings('ignore')\n", - "from sklearn.impute import KNNImputer\n", - "from sklearn.preprocessing import MinMaxScaler" - ] - }, - { - "cell_type": "markdown", - "id": "f95427ef-d6bc-47b8-a516-45a05b238180", - "metadata": {}, - "source": [ - "# 1.1 Random Numbers dataset" - ] - }, - { - "cell_type": "code", - "execution_count": 102, - "id": "03fc0415-cdd2-415b-a273-08037b06afcf", - "metadata": {}, - "outputs": [], - "source": [ - "random_dataset = pd.read_csv('random_numbers_1000.csv')" - ] - }, - { - "cell_type": "code", - "execution_count": 103, - "id": "5ea97930-03cd-48ff-97b9-97e9cd9dde55", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Unnamed: 0number
7827820.955151
3783780.310217
5425420.607177
80800.861696
2822820.204316
9769760.059688
9249240.372837
3293290.406915
1311310.402420
6076070.078909
\n", - "
" - ], - "text/plain": [ - " Unnamed: 0 number\n", - "782 782 0.955151\n", - "378 378 0.310217\n", - "542 542 0.607177\n", - "80 80 0.861696\n", - "282 282 0.204316\n", - "976 976 0.059688\n", - "924 924 0.372837\n", - "329 329 0.406915\n", - "131 131 0.402420\n", - "607 607 0.078909" - ] - }, - "execution_count": 103, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "random_dataset.sample(10)" - ] - }, - { - "cell_type": "code", - "execution_count": 104, - "id": "f19e199b-91aa-4e03-9e07-37f5a574d481", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 1000 entries, 0 to 999\n", - "Data columns (total 2 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 Unnamed: 0 1000 non-null int64 \n", - " 1 number 1000 non-null float64\n", - "dtypes: float64(1), int64(1)\n", - "memory usage: 15.8 KB\n" - ] - } - ], - "source": [ - "random_dataset.info()" - ] - }, - { - "cell_type": "code", - "execution_count": 105, - "id": "382f0f03-b3f4-4244-a95c-e78476fae2ca", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "count 1000.000000\n", - "mean 0.490463\n", - "std 0.284669\n", - "min 0.000068\n", - "25% 0.252124\n", - "50% 0.479825\n", - "75% 0.735584\n", - "max 0.997610\n", - "Name: number, dtype: float64" - ] - }, - "execution_count": 105, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "random_dataset['number'].describe()" - ] - }, - { - "cell_type": "markdown", - "id": "348a0b85-c450-4d5d-a9d2-c57c95964b42", - "metadata": {}, - "source": [ - "#### Create 3 col. for numbers for 1%, 5% and 10% missing data" - ] - }, - { - "cell_type": "code", - "execution_count": 106, - "id": "f5de26b3-17b7-463b-98e4-147a457ca37e", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
00.1446160.1446160.1446160.144616
10.0775150.0775150.0775150.077515
20.1559330.1559330.1559330.155933
30.0972090.0972090.0972090.097209
40.3237500.3237500.3237500.323750
...............
9950.1821070.1821070.1821070.182107
9960.7879880.7879880.7879880.787988
9970.1487070.1487070.1487070.148707
9980.1531210.1531210.1531210.153121
9990.4747370.4747370.4747370.474737
\n", - "

1000 rows × 4 columns

\n", - "
" - ], - "text/plain": [ - " number number_copy_1_percent number_copy_5_percent \\\n", - "0 0.144616 0.144616 0.144616 \n", - "1 0.077515 0.077515 0.077515 \n", - "2 0.155933 0.155933 0.155933 \n", - "3 0.097209 0.097209 0.097209 \n", - "4 0.323750 0.323750 0.323750 \n", - ".. ... ... ... \n", - "995 0.182107 0.182107 0.182107 \n", - "996 0.787988 0.787988 0.787988 \n", - "997 0.148707 0.148707 0.148707 \n", - "998 0.153121 0.153121 0.153121 \n", - "999 0.474737 0.474737 0.474737 \n", - "\n", - " number_copy_10_percent \n", - "0 0.144616 \n", - "1 0.077515 \n", - "2 0.155933 \n", - "3 0.097209 \n", - "4 0.323750 \n", - ".. ... \n", - "995 0.182107 \n", - "996 0.787988 \n", - "997 0.148707 \n", - "998 0.153121 \n", - "999 0.474737 \n", - "\n", - "[1000 rows x 4 columns]" - ] - }, - "execution_count": 106, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_number = random_dataset[['number']]\n", - "df_number['number_copy_1_percent'] = df_number[['number']]\n", - "df_number['number_copy_5_percent'] = df_number[['number']]\n", - "df_number['number_copy_10_percent'] = df_number[['number']]\n", - "df_number" - ] - }, - { - "cell_type": "markdown", - "id": "1ff95002-46a0-454b-97c1-6c189153d459", - "metadata": {}, - "source": [ - "#### Check % missing values in this dataframe" - ] - }, - { - "cell_type": "code", - "execution_count": 107, - "id": "35c38775-26d9-4b1e-97a9-4c46c0d5d92b", - "metadata": {}, - "outputs": [], - "source": [ - "def get_percent_missing(dataframe):\n", - " \n", - " percent_missing = dataframe.isnull().sum() * 100 / len(dataframe)\n", - " missing_value_df = pd.DataFrame({'column_name': dataframe.columns,\n", - " 'percent_missing': percent_missing})\n", - " return missing_value_df" - ] - }, - { - "cell_type": "code", - "execution_count": 108, - "id": "6837b7e5-4444-4914-9c0e-a9cefd2c7b6f", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "number number 0.0\n", - "number_copy_1_percent number_copy_1_percent 0.0\n", - "number_copy_5_percent number_copy_5_percent 0.0\n", - "number_copy_10_percent number_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_number))" - ] - }, - { - "cell_type": "markdown", - "id": "25318ebf-b1bf-4f4b-ba1d-011b27a27f39", - "metadata": {}, - "source": [ - "#### Create missing helper fn" - ] - }, - { - "cell_type": "code", - "execution_count": 109, - "id": "76da9076-d9c8-417e-bcfc-8ce7066d1a53", - "metadata": {}, - "outputs": [], - "source": [ - "def create_missing(dataframe, percent, col):\n", - " dataframe.loc[dataframe.sample(frac = percent).index, col] = np.nan" - ] - }, - { - "cell_type": "markdown", - "id": "9dc43e57-be39-4efe-8131-d6a3423b8d77", - "metadata": {}, - "source": [ - "#### Create missing data in each col" - ] - }, - { - "cell_type": "code", - "execution_count": 110, - "id": "6e8ab693-6043-4ade-b62a-9b3fc9ebf735", - "metadata": {}, - "outputs": [], - "source": [ - "create_missing(df_number, 0.01, 'number_copy_1_percent')\n", - "create_missing(df_number, 0.05, 'number_copy_5_percent')\n", - "create_missing(df_number, 0.1, 'number_copy_10_percent')" - ] - }, - { - "cell_type": "markdown", - "id": "655cb92a-6b63-4498-9c31-d63f11145569", - "metadata": {}, - "source": [ - "#### Check % missing after removing data" - ] - }, - { - "cell_type": "code", - "execution_count": 111, - "id": "412518b5-67ec-4a5a-9720-4a0ce7657d44", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "number number 0.0\n", - "number_copy_1_percent number_copy_1_percent 1.0\n", - "number_copy_5_percent number_copy_5_percent 5.0\n", - "number_copy_10_percent number_copy_10_percent 10.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_number))" - ] - }, - { - "cell_type": "markdown", - "id": "6876e3fc-b878-4560-a3a4-72c36f2a422e", - "metadata": {}, - "source": [ - "#### Store the indices of missing rows" - ] - }, - { - "cell_type": "code", - "execution_count": 112, - "id": "c1860270-add6-4963-9aef-27ef1e171fca", - "metadata": {}, - "outputs": [], - "source": [ - "# Store Index of NaN values in each coloumns\n", - "number_1_idx = list(np.where(df_number['number_copy_1_percent'].isna())[0])\n", - "number_5_idx = list(np.where(df_number['number_copy_5_percent'].isna())[0])\n", - "number_10_idx = list(np.where(df_number['number_copy_10_percent'].isna())[0])" - ] - }, - { - "cell_type": "code", - "execution_count": 113, - "id": "57841da6-b453-40cc-8ecc-702fe4613a74", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Length of number_1_idx is 10 and it contains 1.0% of total data in column | Total rows: 1000\n", - "Length of number_5_idx is 50 and it contains 5.0% of total data in column | Total rows: 1000\n", - "Length of number_10_idx is 100 and it contains 10.0% of total data in column | Total rows: 1000\n" - ] - } - ], - "source": [ - "print(f\"Length of number_1_idx is {len(number_1_idx)} and it contains {(len(number_1_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")\n", - "print(f\"Length of number_5_idx is {len(number_5_idx)} and it contains {(len(number_5_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")\n", - "print(f\"Length of number_10_idx is {len(number_10_idx)} and it contains {(len(number_10_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")" - ] - }, - { - "cell_type": "markdown", - "id": "47469d0b-a8f3-4469-b18c-3a457f7dc373", - "metadata": {}, - "source": [ - "### Perform KNN impute to df_number dataframe" - ] - }, - { - "cell_type": "code", - "execution_count": 114, - "id": "b09c6c85-4ce3-4aeb-bb81-6a698494a58e", - "metadata": {}, - "outputs": [], - "source": [ - "df_number1 = df_number.copy(deep=True)\n", - "imputer = KNNImputer(n_neighbors=5)\n", - "imputed_number_df = pd.DataFrame(imputer.fit_transform(df_number1), columns = df_number1.columns)\n" - ] - }, - { - "cell_type": "code", - "execution_count": 115, - "id": "2f051a7d-3ebd-4839-aae0-ef125944d613", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
3470.3723890.3723890.3723890.372389
9340.3277660.3277660.3277660.327766
9270.7538920.7538920.7538920.753892
9970.1487070.1487070.1487070.148707
1670.7309010.7309010.7309010.730901
9140.8413300.8413300.8413300.841330
4320.8974660.8974660.8974660.897466
5870.4116850.4116850.4116850.411685
8840.3787940.3787940.3787940.378794
3790.2654290.2654290.2654290.264843
\n", - "
" - ], - "text/plain": [ - " number number_copy_1_percent number_copy_5_percent \\\n", - "347 0.372389 0.372389 0.372389 \n", - "934 0.327766 0.327766 0.327766 \n", - "927 0.753892 0.753892 0.753892 \n", - "997 0.148707 0.148707 0.148707 \n", - "167 0.730901 0.730901 0.730901 \n", - "914 0.841330 0.841330 0.841330 \n", - "432 0.897466 0.897466 0.897466 \n", - "587 0.411685 0.411685 0.411685 \n", - "884 0.378794 0.378794 0.378794 \n", - "379 0.265429 0.265429 0.265429 \n", - "\n", - " number_copy_10_percent \n", - "347 0.372389 \n", - "934 0.327766 \n", - "927 0.753892 \n", - "997 0.148707 \n", - "167 0.730901 \n", - "914 0.841330 \n", - "432 0.897466 \n", - "587 0.411685 \n", - "884 0.378794 \n", - "379 0.264843 " - ] - }, - "execution_count": 115, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "imputed_number_df.sample(10)" - ] - }, - { - "cell_type": "markdown", - "id": "ddc79a45-bd2b-44f3-a3c4-aaefa73b43d9", - "metadata": {}, - "source": [ - "#### Check the % missing data in dataframe now" - ] - }, - { - "cell_type": "code", - "execution_count": 116, - "id": "5c98d450-bf5a-46e5-9091-c6a1202a2611", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "number number 0.0\n", - "number_copy_1_percent number_copy_1_percent 0.0\n", - "number_copy_5_percent number_copy_5_percent 0.0\n", - "number_copy_10_percent number_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(imputed_number_df))" - ] - }, - { - "cell_type": "markdown", - "id": "f14476bf-29e6-4d9a-9cd4-9dd56a53b466", - "metadata": {}, - "source": [ - "#### Store the list of differences between org. and Imputed value" - ] - }, - { - "cell_type": "code", - "execution_count": 117, - "id": "3f096800-dc6e-4455-a9e6-2db18884e5ee", - "metadata": {}, - "outputs": [], - "source": [ - "# create list of difference bwtween imputed and orginal value\n", - "\n", - "number_diff_1 = []\n", - "number_diff_5 = []\n", - "number_diff_10 = []\n", - "count = 0\n", - "\n", - "for i in number_1_idx:\n", - " count +=1\n", - " diff1 = abs(imputed_number_df['number_copy_1_percent'][i] - df_number1['number'][i])\n", - " number_diff_1.append(diff1)\n", - " \n", - "\n", - "for i in number_5_idx:\n", - " diff5 = abs(imputed_number_df['number_copy_5_percent'][i] - df_number1['number'][i])\n", - " number_diff_5.append(diff5)\n", - "\n", - "for i in number_10_idx:\n", - " diff10 = abs(imputed_number_df['number_copy_10_percent'][i] - df_number1['number'][i])\n", - " number_diff_10.append(diff10)" - ] - }, - { - "cell_type": "code", - "execution_count": 118, - "id": "4a2c29fc-99f3-4624-808e-437d3983cabb", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(number_diff_1))\n", - "print(len(number_diff_5))\n", - "print(len(number_diff_10))" - ] - }, - { - "cell_type": "markdown", - "id": "4ec4adbe-5571-40e3-90ba-92cb431161ca", - "metadata": {}, - "source": [ - "### Calculate the mean and varience of list of differences KNN" - ] - }, - { - "cell_type": "code", - "execution_count": 119, - "id": "1163cb62-9dc4-427e-b5cf-20bf3e16d79b", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The mean of 1% is 0.0007902710470742466 and varience 1% is 4.5687016451605466e-07\n", - "The mean of 5% is 0.000675654857997236 and varience 5% is 3.072444468179742e-07\n", - "The mean of 10% is 0.000675654857997236 and varience 10% is 2.480608628449602e-07\n" - ] - } - ], - "source": [ - "m1 = sum(number_diff_1) / len(number_diff_1)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res1 = sum((xi - m1) ** 2 for xi in number_diff_1) / len(number_diff_1)\n", - "\n", - "m5 = sum(number_diff_5) / len(number_diff_5)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res5 = sum((xii - m5) ** 2 for xii in number_diff_5) / len(number_diff_5)\n", - "\n", - "\n", - "m10 = sum(number_diff_10) / len(number_diff_10)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res10 = sum((xiii - m10) ** 2 for xiii in number_diff_10) / len(number_diff_10)\n", - "\n", - "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", - "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", - "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 120, - "id": "6987d059-7449-44a0-a3c2-8605362a18a0", - "metadata": {}, - "outputs": [], - "source": [ - "df_knn_number = pd.DataFrame.from_dict({'1%_number': [m1, var_res1],\n", - " '5%_number': [m5, var_res5],\n", - " '10%_number': [m10, var_res10]}, orient='index')\n", - "df_knn_number.columns=['diff. list Mean(KNN)', 'diff. list Var.(KNN)']" - ] - }, - { - "cell_type": "markdown", - "id": "41740e20-5dae-403e-a83b-94c91469fcc3", - "metadata": {}, - "source": [ - "### Perform MEAN based imputation" - ] - }, - { - "cell_type": "markdown", - "id": "17b69478-e97c-41b9-828a-eefbb46eb161", - "metadata": {}, - "source": [ - "#### Before mean imputation % missing" - ] - }, - { - "cell_type": "code", - "execution_count": 121, - "id": "5a828216-8f1a-4157-8141-77e6c929f57a", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "number number 0.0\n", - "number_copy_1_percent number_copy_1_percent 1.0\n", - "number_copy_5_percent number_copy_5_percent 5.0\n", - "number_copy_10_percent number_copy_10_percent 10.0\n" - ] - } - ], - "source": [ - "df_number2 = df_number.copy(deep=True)\n", - "print(get_percent_missing(df_number2))" - ] - }, - { - "cell_type": "code", - "execution_count": 122, - "id": "1e137676-9f01-44b9-8a84-50d03a89436b", - "metadata": {}, - "outputs": [], - "source": [ - "df_number2['number_copy_1_percent'] = df_number2['number_copy_1_percent'].fillna(df_number2['number_copy_1_percent'].mean())\n", - "df_number2['number_copy_5_percent'] = df_number2['number_copy_5_percent'].fillna(df_number2['number_copy_5_percent'].mean())\n", - "df_number2['number_copy_10_percent'] = df_number2['number_copy_10_percent'].fillna(df_number2['number_copy_10_percent'].mean())" - ] - }, - { - "cell_type": "markdown", - "id": "8da82021-d96a-46ac-81df-035977cb5497", - "metadata": {}, - "source": [ - "#### After mean impute % missing " - ] - }, - { - "cell_type": "code", - "execution_count": 123, - "id": "669c14bd-f920-47db-8476-1cd1b4f4f5bb", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "number number 0.0\n", - "number_copy_1_percent number_copy_1_percent 0.0\n", - "number_copy_5_percent number_copy_5_percent 0.0\n", - "number_copy_10_percent number_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_number2))" - ] - }, - { - "cell_type": "code", - "execution_count": 124, - "id": "ccb60d18-b24e-4211-9947-46ee0bcc06fe", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
3660.4255250.4255250.4255250.425525
1450.2465890.2465890.2465890.246589
5380.5037010.5037010.5037010.503701
2560.1189010.1189010.4919320.118901
1560.7732150.7732150.7732150.773215
5000.4410870.4410870.4410870.441087
3250.0950680.0950680.0950680.095068
970.2098420.2098420.2098420.487348
9050.1176570.4910840.1176570.117657
2510.9613050.9613050.9613050.961305
\n", - "
" - ], - "text/plain": [ - " number number_copy_1_percent number_copy_5_percent \\\n", - "366 0.425525 0.425525 0.425525 \n", - "145 0.246589 0.246589 0.246589 \n", - "538 0.503701 0.503701 0.503701 \n", - "256 0.118901 0.118901 0.491932 \n", - "156 0.773215 0.773215 0.773215 \n", - "500 0.441087 0.441087 0.441087 \n", - "325 0.095068 0.095068 0.095068 \n", - "97 0.209842 0.209842 0.209842 \n", - "905 0.117657 0.491084 0.117657 \n", - "251 0.961305 0.961305 0.961305 \n", - "\n", - " number_copy_10_percent \n", - "366 0.425525 \n", - "145 0.246589 \n", - "538 0.503701 \n", - "256 0.118901 \n", - "156 0.773215 \n", - "500 0.441087 \n", - "325 0.095068 \n", - "97 0.487348 \n", - "905 0.117657 \n", - "251 0.961305 " - ] - }, - "execution_count": 124, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_number2.sample(10)" - ] - }, - { - "cell_type": "markdown", - "id": "88d89795-0ae9-4f37-89cd-b24d36658588", - "metadata": {}, - "source": [ - "#### Create a list of difference - MEAN" - ] - }, - { - "cell_type": "code", - "execution_count": 125, - "id": "530979d5-52c4-473d-95f3-754c460a7ab6", - "metadata": {}, - "outputs": [], - "source": [ - "# create list of difference bwtween imputed and orginal value\n", - "\n", - "number_diff_1_mean = []\n", - "number_diff_5_mean = []\n", - "number_diff_10_mean = []\n", - "count = 0\n", - "\n", - "for i in number_1_idx:\n", - " count +=1\n", - " diff1 = abs(df_number2['number_copy_1_percent'][i] - df_number2['number'][i])\n", - " number_diff_1_mean.append(diff1)\n", - " \n", - "\n", - "for i in number_5_idx:\n", - " diff5 = abs(df_number2['number_copy_5_percent'][i] - df_number2['number'][i])\n", - " number_diff_5_mean.append(diff5)\n", - "\n", - "for i in number_10_idx:\n", - " diff10 = abs(df_number2['number_copy_10_percent'][i] - df_number2['number'][i])\n", - " number_diff_10_mean.append(diff10)" - ] - }, - { - "cell_type": "code", - "execution_count": 126, - "id": "28dd2494-0175-431e-b4b7-09ee4af1f6a0", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(number_diff_1_mean))\n", - "print(len(number_diff_5_mean))\n", - "print(len(number_diff_10_mean))" - ] - }, - { - "cell_type": "markdown", - "id": "4e90251e-4c0a-4e2d-82b1-8764374aed1c", - "metadata": {}, - "source": [ - "### Calculate the mean and var of the list of differences - MEAN Impute" - ] - }, - { - "cell_type": "code", - "execution_count": 127, - "id": "682bd76e-4875-4b4d-b90b-91d8a6e492ae", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The mean of 1% is 0.269368727544059 and varience 1% is 0.018130331928686818\n", - "The mean of 5% is 0.18484105170274112 and varience 5% is 0.014920933643125705\n", - "The mean of 10% is 0.18484105170274112 and varience 10% is 0.020023889816061954\n" - ] - } - ], - "source": [ - "m1 = sum(number_diff_1_mean) / len(number_diff_1_mean)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res1 = sum((xi - m1) ** 2 for xi in number_diff_1_mean) / len(number_diff_1_mean)\n", - "\n", - "m5 = sum(number_diff_5_mean) / len(number_diff_5_mean)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res5 = sum((xii - m5) ** 2 for xii in number_diff_5_mean) / len(number_diff_5_mean)\n", - "\n", - "\n", - "m10 = sum(number_diff_10_mean) / len(number_diff_10_mean)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res10 = sum((xiii - m10) ** 2 for xiii in number_diff_10_mean) / len(number_diff_10_mean)\n", - "\n", - "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", - "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", - "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 128, - "id": "1f41880d-3e7d-48c9-8744-7e47ccae3c17", - "metadata": {}, - "outputs": [], - "source": [ - "df_MI_number = pd.DataFrame.from_dict({'1%_number': [m1, var_res1],\n", - " '5%_number': [m5, var_res5],\n", - " '10%_number': [m10, var_res10]}, orient='index')\n", - "df_MI_number.columns=['diff. list Mean(MI)', 'diff. list Var.(MI)']" - ] - }, - { - "cell_type": "markdown", - "id": "ec64b079-db97-429c-ae3a-519eec91db3f", - "metadata": {}, - "source": [ - "## KNN and MEAN columns side by side" - ] - }, - { - "cell_type": "code", - "execution_count": 129, - "id": "d74b0e73-e3f0-4107-806d-c5d5a50aab9a", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import display_html\n", - "from itertools import chain,cycle\n", - "def display_side_by_side(*args,titles=cycle([''])):\n", - " html_str=''\n", - " for df,title in zip(args, chain(titles,cycle(['
'])) ):\n", - " html_str+=''\n", - " html_str+=f'

{title}

'\n", - " html_str+=df.to_html().replace('table','table style=\"display:inline\"')\n", - " html_str+=''\n", - " display_html(html_str,raw=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 130, - "id": "747a487f-cbc4-467a-9bc7-b0856dbb6576", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 130, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from IPython.display import display, HTML\n", - "\n", - "CSS = \"\"\"\n", - ".output {\n", - " flex-direction: row;\n", - "}\n", - "\"\"\"\n", - "\n", - "HTML(''.format(CSS))" - ] - }, - { - "cell_type": "code", - "execution_count": 131, - "id": "d24551d1-cd58-4a41-8262-873fe5034272", - "metadata": {}, - "outputs": [], - "source": [ - "# https://github.com/epmoyer/ipy_table/issues/24\n", - "\n", - "from IPython.core.display import HTML\n", - "\n", - "def multi_table(table_list):\n", - " ''' Acceps a list of IpyTable objects and returns a table which contains each IpyTable in a cell\n", - " '''\n", - " return HTML(\n", - " '' + \n", - " ''.join(['' for table in table_list]) +\n", - " '
' + table._repr_html_() + '
'\n", - " )" - ] - }, - { - "cell_type": "code", - "execution_count": 132, - "id": "8a8daa30-3abf-4315-ae58-f9171ff000d5", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[124, 257, 309, 313, 405]\n" - ] - } - ], - "source": [ - "print(number_1_idx[:5])" - ] - }, - { - "cell_type": "code", - "execution_count": 133, - "id": "da6b1646-2417-42b7-bc8f-d3b0be85c61b", - "metadata": {}, - "outputs": [], - "source": [ - "compare_1 = imputed_number_df.loc[:, [\"number\", \"number_copy_1_percent\"]]\n", - "compare_5 = imputed_number_df.loc[:, [\"number\", \"number_copy_5_percent\"]]\n", - "compare_10 = imputed_number_df.loc[:, [\"number\", \"number_copy_10_percent\"]]" - ] - }, - { - "cell_type": "code", - "execution_count": 134, - "id": "380b94cf-264f-4a41-bb1d-ac272354073f", - "metadata": {}, - "outputs": [], - "source": [ - "compare_1_df = compare_1.iloc[number_1_idx]\n", - "compare_5_df = compare_5.iloc[number_5_idx]\n", - "compare_10_df = compare_10.iloc[number_10_idx]" - ] - }, - { - "cell_type": "code", - "execution_count": 135, - "id": "e5b21e71-0ddd-4c60-b931-b384d65230dd", - "metadata": {}, - "outputs": [], - "source": [ - "compare_1_mean = df_number2.loc[:, [\"number\", \"number_copy_1_percent\"]]\n", - "compare_5_mean = df_number2.loc[:, [\"number\", \"number_copy_5_percent\"]]\n", - "compare_10_mean = df_number2.loc[:, [\"number\", \"number_copy_10_percent\"]]" - ] - }, - { - "cell_type": "code", - "execution_count": 136, - "id": "29be3554-8129-4f0c-bad6-1270b7c6c05b", - "metadata": {}, - "outputs": [], - "source": [ - "compare_1_mean_df = compare_1_mean.iloc[number_1_idx]\n", - "compare_5_mean_df = compare_5_mean.iloc[number_5_idx]\n", - "compare_10_mean_df = compare_10_mean.iloc[number_10_idx]" - ] - }, - { - "cell_type": "code", - "execution_count": 137, - "id": "27b96ecc-3566-48f5-bec5-9b073c575cb6", - "metadata": {}, - "outputs": [], - "source": [ - "# display_side_by_side(compare_1_df.head(), compare_1_mean_df.head(), titles=['number 1% KNN Impute','number 1% Mean Impute'])\n", - "# display_side_by_side(compare_5_df.head(), compare_5_mean_df.head(), titles=['number 5% KNN Impute','number 5% Mean Impute'])\n", - "# display_side_by_side(compare_10_df.head(), compare_10_mean_df.head(), titles=['number 10% KNN Impute','number 10% Mean Impute'])" - ] - }, - { - "cell_type": "markdown", - "id": "72a3bc3c-0f91-49ad-bf03-dc4b7ace265d", - "metadata": {}, - "source": [ - "#### **number 1% KNN Impute VS number 1% Mean Impute**" - ] - }, - { - "cell_type": "code", - "execution_count": 138, - "id": "6fd11f89-9f4b-49b3-b114-1ab3b461f180", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_1_percent
1240.1929900.192926
2570.0656020.066172
3090.6614470.663769
3130.9639510.962988
4050.6274600.627545
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_1_percent
1240.1929900.491084
2570.0656020.491084
3090.6614470.491084
3130.9639510.491084
4050.6274600.491084
\n", - "
" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 138, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "multi_table([compare_1_df.head(), compare_1_mean_df.head()])" - ] - }, - { - "cell_type": "markdown", - "id": "e1fc9d1c-53ef-42d3-809b-d68051057e48", - "metadata": {}, - "source": [ - "#### **number 5% KNN Impute VS number 5% Mean Impute**" - ] - }, - { - "cell_type": "code", - "execution_count": 139, - "id": "a97c1530-2e50-48d2-a7e0-89fc70f648e5", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_5_percent
540.4401440.439307
590.1896550.191045
720.4114510.412386
780.2051780.204306
1070.3230970.322044
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_5_percent
540.4401440.491932
590.1896550.491932
720.4114510.491932
780.2051780.491932
1070.3230970.491932
\n", - "
" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 139, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "multi_table([compare_5_df.head(), compare_5_mean_df.head()])" - ] - }, - { - "cell_type": "markdown", - "id": "1e732ac9-faf7-4457-baef-ac9c4976598c", - "metadata": {}, - "source": [ - "#### **number 10% KNN Impute VS number 10% Mean Impute**" - ] - }, - { - "cell_type": "code", - "execution_count": 140, - "id": "f2d22e8f-5a0b-48c0-9150-a391d48e93b2", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_10_percent
220.7981880.798777
470.8614540.861385
490.4451080.446055
680.5574680.557299
690.2311720.230069
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_10_percent
220.7981880.487348
470.8614540.487348
490.4451080.487348
680.5574680.487348
690.2311720.487348
\n", - "
" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 140, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "multi_table([compare_10_df.head(), compare_10_mean_df.head()])" - ] - }, - { - "cell_type": "markdown", - "id": "cc817314-971f-4abf-a56e-9830a5cf0329", - "metadata": {}, - "source": [ - "# 1.2 Random Numbers dataset Results - KNN and MEAN" - ] - }, - { - "cell_type": "code", - "execution_count": 142, - "id": "1397844d-6757-471c-bd76-ff84d466b150", - "metadata": {}, - "outputs": [], - "source": [ - "results = pd.concat([df_knn_number, df_MI_number])" - ] - }, - { - "cell_type": "code", - "execution_count": 143, - "id": "51868cc7-20f3-499d-a76d-f06f99ea1841", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
diff. list Mean(KNN)diff. list Var.(KNN)diff. list Mean(MI)diff. list Var.(MI)
1%_number0.0007904.568702e-07NaNNaN
5%_number0.0006763.072444e-07NaNNaN
10%_number0.0006482.480609e-07NaNNaN
1%_numberNaNNaN0.2693690.018130
5%_numberNaNNaN0.1848410.014921
10%_numberNaNNaN0.2315010.020024
\n", - "
" - ], - "text/plain": [ - " diff. list Mean(KNN) diff. list Var.(KNN) diff. list Mean(MI) \\\n", - "1%_number 0.000790 4.568702e-07 NaN \n", - "5%_number 0.000676 3.072444e-07 NaN \n", - "10%_number 0.000648 2.480609e-07 NaN \n", - "1%_number NaN NaN 0.269369 \n", - "5%_number NaN NaN 0.184841 \n", - "10%_number NaN NaN 0.231501 \n", - "\n", - " diff. list Var.(MI) \n", - "1%_number NaN \n", - "5%_number NaN \n", - "10%_number NaN \n", - "1%_number 0.018130 \n", - "5%_number 0.014921 \n", - "10%_number 0.020024 " - ] - }, - "execution_count": 143, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "results" - ] - }, - { - "cell_type": "code", - "execution_count": 144, - "id": "85deaebb-3a2b-4b52-bf80-ce31499a70d8", - "metadata": {}, - "outputs": [], - "source": [ - "results.to_csv('random_num_knn_mean_results.csv')" - ] - }, - { - "cell_type": "markdown", - "id": "08586561-e3a5-4d15-a1c0-b8d71731a84a", - "metadata": {}, - "source": [ - "# 2.1 Housing Dataset " - ] - }, - { - "cell_type": "code", - "execution_count": 361, - "id": "c05f4dd5-4cdc-4617-939a-2e22ec859af1", - "metadata": {}, - "outputs": [], - "source": [ - "housing_data = pd.read_csv('https://raw.githubusercontent.com/nikbearbrown/AI_Research_Group/main/Awesome-UCI-Datasets/Classification/House_Price_predication/train.csv')" - ] - }, - { - "cell_type": "code", - "execution_count": 362, - "id": "8564d163-97ce-44da-8d3c-6f8cd9c1d0a1", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
82082160RL72.07226PaveNaNIR1LvlAllPub...0NaNNaNNaN062008WDNormal183000
1390139120RL70.09100PaveNaNRegLvlAllPub...0NaNNaNNaN092006WDNormal235000
535536190RL70.07000PaveNaNRegLvlAllPub...0NaNNaNNaN012008WDNormal107500
12361237160RL36.02628PaveNaNRegLvlAllPub...0NaNNaNNaN062010WDNormal175500
1337133830RM153.04118PaveGrvlIR1BnkAllPub...0NaNNaNNaN032006WDNormal52500
67467520RL80.09200PaveNaNRegLvlAllPub...0NaNNaNNaN072008WDNormal140000
60460520RL88.012803PaveNaNIR1LvlAllPub...0NaNNaNNaN092008WDNormal221000
60560660RL85.013600PaveNaNRegLvlAllPub...0NaNNaNNaN0102009WDNormal205000
1218121950RM52.06240PaveNaNRegLvlAllPub...0NaNNaNNaN072006WDNormal80500
88288360RLNaN9636PaveNaNIR1LvlAllPub...0NaNMnPrvNaN0122009WDNormal178000
\n", - "

10 rows × 81 columns

\n", - "
" - ], - "text/plain": [ - " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", - "820 821 60 RL 72.0 7226 Pave NaN IR1 \n", - "1390 1391 20 RL 70.0 9100 Pave NaN Reg \n", - "535 536 190 RL 70.0 7000 Pave NaN Reg \n", - "1236 1237 160 RL 36.0 2628 Pave NaN Reg \n", - "1337 1338 30 RM 153.0 4118 Pave Grvl IR1 \n", - "674 675 20 RL 80.0 9200 Pave NaN Reg \n", - "604 605 20 RL 88.0 12803 Pave NaN IR1 \n", - "605 606 60 RL 85.0 13600 Pave NaN Reg \n", - "1218 1219 50 RM 52.0 6240 Pave NaN Reg \n", - "882 883 60 RL NaN 9636 Pave NaN IR1 \n", - "\n", - " LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal \\\n", - "820 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "1390 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "535 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "1236 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "1337 Bnk AllPub ... 0 NaN NaN NaN 0 \n", - "674 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "604 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "605 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "1218 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "882 Lvl AllPub ... 0 NaN MnPrv NaN 0 \n", - "\n", - " MoSold YrSold SaleType SaleCondition SalePrice \n", - "820 6 2008 WD Normal 183000 \n", - "1390 9 2006 WD Normal 235000 \n", - "535 1 2008 WD Normal 107500 \n", - "1236 6 2010 WD Normal 175500 \n", - "1337 3 2006 WD Normal 52500 \n", - "674 7 2008 WD Normal 140000 \n", - "604 9 2008 WD Normal 221000 \n", - "605 10 2009 WD Normal 205000 \n", - "1218 7 2006 WD Normal 80500 \n", - "882 12 2009 WD Normal 178000 \n", - "\n", - "[10 rows x 81 columns]" - ] - }, - "execution_count": 362, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "housing_data.sample(10)" - ] - }, - { - "cell_type": "code", - "execution_count": 363, - "id": "bd81975c-0a21-414b-8e20-3564d35b9f9b", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "663" - ] - }, - "execution_count": 363, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "housing_data['SalePrice'].nunique()" - ] - }, - { - "cell_type": "code", - "execution_count": 364, - "id": "67d1046e-a1ad-412e-a7e8-a0d51729cec7", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "1073" - ] - }, - "execution_count": 364, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "housing_data['LotArea'].nunique()" - ] - }, - { - "cell_type": "code", - "execution_count": 365, - "id": "64b05e52-72dc-4f7d-aca3-d043036b4d2f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "count 1460.000000\n", - "mean 180921.195890\n", - "std 79442.502883\n", - "min 34900.000000\n", - "25% 129975.000000\n", - "50% 163000.000000\n", - "75% 214000.000000\n", - "max 755000.000000\n", - "Name: SalePrice, dtype: float64" - ] - }, - "execution_count": 365, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "housing_data['SalePrice'].describe()" - ] - }, - { - "cell_type": "code", - "execution_count": 366, - "id": "b7e9928c-4785-4ee1-8150-cd0fa1ef3325", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "count 1460.000000\n", - "mean 10516.828082\n", - "std 9981.264932\n", - "min 1300.000000\n", - "25% 7553.500000\n", - "50% 9478.500000\n", - "75% 11601.500000\n", - "max 215245.000000\n", - "Name: LotArea, dtype: float64" - ] - }, - "execution_count": 366, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "housing_data['LotArea'].describe()" - ] - }, - { - "cell_type": "code", - "execution_count": 367, - "id": "20149f80-07dc-4eaa-8d0e-7de6612a7dce", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "Id Id 0.000000\n", - "MSSubClass MSSubClass 0.000000\n", - "MSZoning MSZoning 0.000000\n", - "LotFrontage LotFrontage 17.739726\n", - "LotArea LotArea 0.000000\n", - "Street Street 0.000000\n", - "Alley Alley 93.767123\n", - "LotShape LotShape 0.000000\n", - "LandContour LandContour 0.000000\n", - "Utilities Utilities 0.000000\n", - "LotConfig LotConfig 0.000000\n", - "LandSlope LandSlope 0.000000\n", - "Neighborhood Neighborhood 0.000000\n", - "Condition1 Condition1 0.000000\n", - "Condition2 Condition2 0.000000\n", - "BldgType BldgType 0.000000\n", - "HouseStyle HouseStyle 0.000000\n", - "OverallQual OverallQual 0.000000\n", - "OverallCond OverallCond 0.000000\n", - "YearBuilt YearBuilt 0.000000\n", - "YearRemodAdd YearRemodAdd 0.000000\n", - "RoofStyle RoofStyle 0.000000\n", - "RoofMatl RoofMatl 0.000000\n", - "Exterior1st Exterior1st 0.000000\n", - "Exterior2nd Exterior2nd 0.000000\n", - "MasVnrType MasVnrType 0.547945\n", - "MasVnrArea MasVnrArea 0.547945\n", - "ExterQual ExterQual 0.000000\n", - "ExterCond ExterCond 0.000000\n", - "Foundation Foundation 0.000000\n", - "BsmtQual BsmtQual 2.534247\n", - "BsmtCond BsmtCond 2.534247\n", - "BsmtExposure BsmtExposure 2.602740\n", - "BsmtFinType1 BsmtFinType1 2.534247\n", - "BsmtFinSF1 BsmtFinSF1 0.000000\n", - "BsmtFinType2 BsmtFinType2 2.602740\n", - "BsmtFinSF2 BsmtFinSF2 0.000000\n", - "BsmtUnfSF BsmtUnfSF 0.000000\n", - "TotalBsmtSF TotalBsmtSF 0.000000\n", - "Heating Heating 0.000000\n", - "HeatingQC HeatingQC 0.000000\n", - "CentralAir CentralAir 0.000000\n", - "Electrical Electrical 0.068493\n", - "1stFlrSF 1stFlrSF 0.000000\n", - "2ndFlrSF 2ndFlrSF 0.000000\n", - "LowQualFinSF LowQualFinSF 0.000000\n", - "GrLivArea GrLivArea 0.000000\n", - "BsmtFullBath BsmtFullBath 0.000000\n", - "BsmtHalfBath BsmtHalfBath 0.000000\n", - "FullBath FullBath 0.000000\n", - "HalfBath HalfBath 0.000000\n", - "BedroomAbvGr BedroomAbvGr 0.000000\n", - "KitchenAbvGr KitchenAbvGr 0.000000\n", - "KitchenQual KitchenQual 0.000000\n", - "TotRmsAbvGrd TotRmsAbvGrd 0.000000\n", - "Functional Functional 0.000000\n", - "Fireplaces Fireplaces 0.000000\n", - "FireplaceQu FireplaceQu 47.260274\n", - "GarageType GarageType 5.547945\n", - "GarageYrBlt GarageYrBlt 5.547945\n", - "GarageFinish GarageFinish 5.547945\n", - "GarageCars GarageCars 0.000000\n", - "GarageArea GarageArea 0.000000\n", - "GarageQual GarageQual 5.547945\n", - "GarageCond GarageCond 5.547945\n", - "PavedDrive PavedDrive 0.000000\n", - "WoodDeckSF WoodDeckSF 0.000000\n", - "OpenPorchSF OpenPorchSF 0.000000\n", - "EnclosedPorch EnclosedPorch 0.000000\n", - "3SsnPorch 3SsnPorch 0.000000\n", - "ScreenPorch ScreenPorch 0.000000\n", - "PoolArea PoolArea 0.000000\n", - "PoolQC PoolQC 99.520548\n", - "Fence Fence 80.753425\n", - "MiscFeature MiscFeature 96.301370\n", - "MiscVal MiscVal 0.000000\n", - "MoSold MoSold 0.000000\n", - "YrSold YrSold 0.000000\n", - "SaleType SaleType 0.000000\n", - "SaleCondition SaleCondition 0.000000\n", - "SalePrice SalePrice 0.000000\n" - ] - } - ], - "source": [ - "pd.set_option('display.max_rows', None)\n", - "print(get_percent_missing(housing_data))" - ] - }, - { - "cell_type": "markdown", - "id": "c8eb3ee3-085d-4b41-9a5f-c83a3805f870", - "metadata": {}, - "source": [ - "#### Using Sale price coloumn for KNN and MEAN imputation task" - ] - }, - { - "cell_type": "markdown", - "id": "451c79fb-17ba-40ac-8f0b-87a8b2ec4837", - "metadata": {}, - "source": [ - "#### Non Scaled dataframe Sale Price - take first 1000 rows" - ] - }, - { - "cell_type": "code", - "execution_count": 368, - "id": "9cc1f97f-1b24-4570-8f6a-30426bd79269", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
0208500208500208500208500
1181500181500181500181500
2223500223500223500223500
3140000140000140000140000
4250000250000250000250000
\n", - "
" - ], - "text/plain": [ - " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", - "0 208500 208500 208500 208500\n", - "1 181500 181500 181500 181500\n", - "2 223500 223500 223500 223500\n", - "3 140000 140000 140000 140000\n", - "4 250000 250000 250000 250000" - ] - }, - "execution_count": 368, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_saleprice = housing_data[['SalePrice']][:1000]\n", - "df_saleprice['sp_copy_1_percent'] = df_saleprice[['SalePrice']]\n", - "df_saleprice['sp_copy_5_percent'] = df_saleprice[['SalePrice']]\n", - "df_saleprice['sp_copy_10_percent'] = df_saleprice[['SalePrice']]\n", - "df_saleprice.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 369, - "id": "f462f065-9f37-44f1-a22e-92e610dae2e9", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "1000" - ] - }, - "execution_count": 369, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "len(df_saleprice)" - ] - }, - { - "cell_type": "markdown", - "id": "03407bbd-f8a7-4f6c-a7c3-64a865ed3f7e", - "metadata": {}, - "source": [ - "#### Scaled Dataframe SalePrice - take first 1000 rows" - ] - }, - { - "cell_type": "code", - "execution_count": 370, - "id": "e461b1ef-df2c-410f-aea8-abe954fa9afd", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
00.2410780.2410780.2410780.241078
10.2035830.2035830.2035830.203583
20.2619080.2619080.2619080.261908
30.1459520.1459520.1459520.145952
40.2987090.2987090.2987090.298709
\n", - "
" - ], - "text/plain": [ - " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", - "0 0.241078 0.241078 0.241078 0.241078\n", - "1 0.203583 0.203583 0.203583 0.203583\n", - "2 0.261908 0.261908 0.261908 0.261908\n", - "3 0.145952 0.145952 0.145952 0.145952\n", - "4 0.298709 0.298709 0.298709 0.298709" - ] - }, - "execution_count": 370, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "scaler = MinMaxScaler()\n", - "df_saleprice_scaled = df_saleprice.copy(deep=True)\n", - "df_saleprice_scaled = pd.DataFrame(scaler.fit_transform(df_saleprice_scaled), columns = df_saleprice_scaled.columns)\n", - "df_saleprice_scaled.head()" - ] - }, - { - "cell_type": "markdown", - "id": "a66683c4-f66a-4aa1-ab8a-f28087b60b6c", - "metadata": {}, - "source": [ - "#### Check % missing values in this dataframe" - ] - }, - { - "cell_type": "code", - "execution_count": 371, - "id": "0075fa0f-4b82-4089-ab81-e5282497c4a3", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 0.0\n", - "sp_copy_5_percent sp_copy_5_percent 0.0\n", - "sp_copy_10_percent sp_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice))" - ] - }, - { - "cell_type": "markdown", - "id": "619ef99f-55c0-422c-aaa8-73cd71fcf2fb", - "metadata": {}, - "source": [ - "#### Create 1%, 5% and 10% missing data" - ] - }, - { - "cell_type": "code", - "execution_count": 372, - "id": "82df5098-4176-4fba-922f-ca84c0466f2a", - "metadata": {}, - "outputs": [], - "source": [ - "create_missing(df_saleprice, 0.01, 'sp_copy_1_percent')\n", - "create_missing(df_saleprice, 0.05, 'sp_copy_5_percent')\n", - "create_missing(df_saleprice, 0.1, 'sp_copy_10_percent')" - ] - }, - { - "cell_type": "code", - "execution_count": 373, - "id": "0e90ae04-cd10-4507-a851-c187010f0be0", - "metadata": {}, - "outputs": [], - "source": [ - "create_missing(df_saleprice_scaled, 0.01, 'sp_copy_1_percent')\n", - "create_missing(df_saleprice_scaled, 0.05, 'sp_copy_5_percent')\n", - "create_missing(df_saleprice_scaled, 0.1, 'sp_copy_10_percent')" - ] - }, - { - "cell_type": "markdown", - "id": "a8237a82-5a33-4ce9-b4c7-a48ede4f5fef", - "metadata": {}, - "source": [ - "#### With/Without scaling dataframe missing values check" - ] - }, - { - "cell_type": "code", - "execution_count": 374, - "id": "2794306d-89c7-4518-8979-9edb3d9441b1", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 1.0\n", - "sp_copy_5_percent sp_copy_5_percent 5.0\n", - "sp_copy_10_percent sp_copy_10_percent 10.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice))" - ] - }, - { - "cell_type": "code", - "execution_count": 375, - "id": "8351dbe2-b388-451d-9238-52c4ccabd425", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 1.0\n", - "sp_copy_5_percent sp_copy_5_percent 5.0\n", - "sp_copy_10_percent sp_copy_10_percent 10.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice_scaled))" - ] - }, - { - "cell_type": "code", - "execution_count": 376, - "id": "b11b093f-110b-4ef3-9d00-ac4fed45a956", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "10" - ] - }, - "execution_count": 376, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_saleprice['sp_copy_1_percent'].isna().sum()" - ] - }, - { - "cell_type": "markdown", - "id": "360e0010-e085-435c-8902-80c6a7ea78be", - "metadata": {}, - "source": [ - "#### Store indices of missing values" - ] - }, - { - "cell_type": "code", - "execution_count": 377, - "id": "e546096c-ce35-448e-aa97-0943d3535a87", - "metadata": {}, - "outputs": [], - "source": [ - "# Store Index of NaN values in each coloumns\n", - "sp_1_idx = list(np.where(df_saleprice['sp_copy_1_percent'].isna())[0])\n", - "sp_5_idx = list(np.where(df_saleprice['sp_copy_5_percent'].isna())[0])\n", - "sp_10_idx = list(np.where(df_saleprice['sp_copy_10_percent'].isna())[0])" - ] - }, - { - "cell_type": "code", - "execution_count": 378, - "id": "d409e2a5-b3a9-4ae1-9b17-88b7c642692d", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(sp_1_idx))\n", - "print(len(sp_5_idx))\n", - "print(len(sp_10_idx))" - ] - }, - { - "cell_type": "code", - "execution_count": 379, - "id": "5839460a-e736-42e9-9a13-d5bab5683115", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Length of sp_1_idx is 10 and it contains 1.0% of total data in column | Total rows: 1000\n", - "Length of sp_5_idx is 50 and it contains 5.0% of total data in column | Total rows: 1000\n", - "Length of sp_10_idx is 100 and it contains 10.0% of total data in column | Total rows: 1000\n" - ] - } - ], - "source": [ - "print(f\"Length of sp_1_idx is {len(sp_1_idx)} and it contains {(len(sp_1_idx)/len(df_saleprice['sp_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")\n", - "print(f\"Length of sp_5_idx is {len(sp_5_idx)} and it contains {(len(sp_5_idx)/len(df_saleprice['sp_copy_5_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")\n", - "print(f\"Length of sp_10_idx is {len(sp_10_idx)} and it contains {(len(sp_10_idx)/len(df_saleprice['sp_copy_10_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")" - ] - }, - { - "cell_type": "markdown", - "id": "c1464c79-c0a9-4640-92dd-f0d5131634ab", - "metadata": {}, - "source": [ - "### Perform KNN to df_saleprice and df_saleprice_scaled dataframe" - ] - }, - { - "cell_type": "code", - "execution_count": 380, - "id": "08fa2436-ffb8-4b5d-a7a1-9e2d63b14562", - "metadata": {}, - "outputs": [], - "source": [ - "df_saleprice1 = df_saleprice.copy(deep=True)\n", - "imputer = KNNImputer(n_neighbors=5)\n", - "imputed_saleprice_df = pd.DataFrame(imputer.fit_transform(df_saleprice1), columns = df_saleprice1.columns)" - ] - }, - { - "cell_type": "code", - "execution_count": 381, - "id": "205c7a96-3f1c-42a4-91de-f22f15ce9cb2", - "metadata": {}, - "outputs": [], - "source": [ - "df_saleprice_scaled1 = df_saleprice_scaled.copy(deep=True)\n", - "imputer = KNNImputer(n_neighbors=5)\n", - "imputed_saleprice_scaled_df = pd.DataFrame(imputer.fit_transform(df_saleprice_scaled1), columns = df_saleprice_scaled1.columns)" - ] - }, - { - "cell_type": "code", - "execution_count": 382, - "id": "a482f58d-73b6-423c-b97a-140884830a0f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
0208500.0208500.0208500.0208500.0
1181500.0181500.0181500.0181500.0
2223500.0223500.0223500.0223500.0
3140000.0140000.0140000.0140000.0
4250000.0250000.0250000.0250000.0
\n", - "
" - ], - "text/plain": [ - " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", - "0 208500.0 208500.0 208500.0 208500.0\n", - "1 181500.0 181500.0 181500.0 181500.0\n", - "2 223500.0 223500.0 223500.0 223500.0\n", - "3 140000.0 140000.0 140000.0 140000.0\n", - "4 250000.0 250000.0 250000.0 250000.0" - ] - }, - "execution_count": 382, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "imputed_saleprice_df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 383, - "id": "11f8f5ff-f06d-4ec2-a4e3-1324e807a537", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
00.2410780.2410780.2408550.241078
10.2035830.2035830.2035830.203583
20.2619080.2619080.2619080.261908
30.1459520.1459520.1459520.145952
40.2987090.2987090.2987090.298709
\n", - "
" - ], - "text/plain": [ - " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", - "0 0.241078 0.241078 0.240855 0.241078\n", - "1 0.203583 0.203583 0.203583 0.203583\n", - "2 0.261908 0.261908 0.261908 0.261908\n", - "3 0.145952 0.145952 0.145952 0.145952\n", - "4 0.298709 0.298709 0.298709 0.298709" - ] - }, - "execution_count": 383, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "imputed_saleprice_scaled_df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "d9fd7fa1-4ce0-43be-9955-55ef759d930b", - "metadata": {}, - "source": [ - "#### Check % missing in saleprice and saleprice_scaled DF" - ] - }, - { - "cell_type": "code", - "execution_count": 384, - "id": "9ed0d36a-9584-4e3b-9201-2ac36827bce9", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 0.0\n", - "sp_copy_5_percent sp_copy_5_percent 0.0\n", - "sp_copy_10_percent sp_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(imputed_saleprice_df))" - ] - }, - { - "cell_type": "code", - "execution_count": 385, - "id": "7c842fce-bbd5-4c2c-bb1a-db5df92f6315", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 0.0\n", - "sp_copy_5_percent sp_copy_5_percent 0.0\n", - "sp_copy_10_percent sp_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(imputed_saleprice_scaled_df))" - ] - }, - { - "cell_type": "markdown", - "id": "ac47abb1-df5f-4686-bc67-6617140c008c", - "metadata": {}, - "source": [ - "#### Store the list of disfferences between Org. and Imputed Value" - ] - }, - { - "cell_type": "code", - "execution_count": 386, - "id": "99e04554-568d-4efa-a110-768b50dfaee6", - "metadata": {}, - "outputs": [], - "source": [ - "# create list of difference bwtween imputed and orginal value\n", - "\n", - "sp_diff_1 = []\n", - "sp_diff_5 = []\n", - "sp_diff_10 = []\n", - "count = 0\n", - "\n", - "for i in sp_1_idx:\n", - " count +=1\n", - " diff1 = abs(imputed_saleprice_df['sp_copy_1_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", - " sp_diff_1.append(diff1)\n", - " \n", - "\n", - "for i in sp_5_idx:\n", - " diff5 = abs(imputed_saleprice_df['sp_copy_5_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", - " sp_diff_5.append(diff5)\n", - "\n", - "for i in sp_10_idx:\n", - " diff10 = abs(imputed_saleprice_df['sp_copy_10_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", - " sp_diff_10.append(diff10)" - ] - }, - { - "cell_type": "code", - "execution_count": 387, - "id": "92204f8a-497c-470d-a770-59165d226cc9", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(sp_diff_1))\n", - "print(len(sp_diff_5))\n", - "print(len(sp_diff_10))" - ] - }, - { - "cell_type": "code", - "execution_count": 388, - "id": "b8875fff-0289-4dd9-92c1-78dc9b730d22", - "metadata": {}, - "outputs": [], - "source": [ - "# create list of difference bwtween imputed and orginal value\n", - "\n", - "sp_scaled_diff_1 = []\n", - "sp_scaled_diff_5 = []\n", - "sp_scaled_diff_10 = []\n", - "count = 0\n", - "\n", - "for i in sp_1_idx:\n", - " count +=1\n", - " diff1 = abs(imputed_saleprice_scaled_df['sp_copy_1_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", - " sp_scaled_diff_1.append(diff1)\n", - " \n", - "\n", - "for i in sp_5_idx:\n", - " diff5 = abs(imputed_saleprice_scaled_df['sp_copy_5_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", - " sp_scaled_diff_5.append(diff5)\n", - "\n", - "for i in sp_10_idx:\n", - " diff10 = abs(imputed_saleprice_scaled_df['sp_copy_10_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", - " sp_scaled_diff_10.append(diff10)" - ] - }, - { - "cell_type": "code", - "execution_count": 389, - "id": "40192344-79a4-444c-a12a-2201dc5aa0c1", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(sp_scaled_diff_1))\n", - "print(len(sp_scaled_diff_5))\n", - "print(len(sp_scaled_diff_10))" - ] - }, - { - "cell_type": "code", - "execution_count": 390, - "id": "a95bd45c-8a2f-4159-8306-399ec18a4c0f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[0.0, 0.0, 0.0, 0.0, 0.0]" - ] - }, - "execution_count": 390, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sp_scaled_diff_1[:5]" - ] - }, - { - "cell_type": "code", - "execution_count": 391, - "id": "0f73d420-8842-4062-ae17-158a0a25e169", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[10.0, 20.0, 80.0, 220.0, 0.0]" - ] - }, - "execution_count": 391, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sp_diff_1[:5]" - ] - }, - { - "cell_type": "markdown", - "id": "a40fd400-913b-4011-b0b9-dd3ca0d5827a", - "metadata": {}, - "source": [ - "#### Calculate the mean and var of list of diff. KNN - SalePrice" - ] - }, - { - "cell_type": "code", - "execution_count": 392, - "id": "80267827-7f73-49ff-b200-27cdb2963756", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The mean of 1% is 170.0 and varience 1% is 42400.0\n", - "The mean of 5% is 444.9439999999997 and varience 5% is 2554554.1584639903\n", - "The mean of 10% is 444.9439999999997 and varience 10% is 6304766.8341439795\n" - ] - } - ], - "source": [ - "m1 = sum(sp_diff_1) / len(sp_diff_1)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res1 = sum((xi - m1) ** 2 for xi in sp_diff_1) / len(sp_diff_1)\n", - "\n", - "m5 = sum(sp_diff_5) / len(sp_diff_5)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res5 = sum((xii - m5) ** 2 for xii in sp_diff_5) / len(sp_diff_5)\n", - "\n", - "\n", - "m10 = sum(sp_diff_10) / len(sp_diff_10)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_diff_10) / len(sp_diff_10)\n", - "\n", - "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", - "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", - "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 393, - "id": "358545ff-2fcf-4c99-9049-4eaf6dd110bd", - "metadata": {}, - "outputs": [], - "source": [ - "df_knn_saleprice = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", - " '5%_saleprice': [m5, var_res5],\n", - " '10%_saleprice': [m10, var_res10]}, orient='index')\n", - "df_knn_saleprice.columns=['diff. list Mean(KNN)', 'diff. list Var.(KNN)']" - ] - }, - { - "cell_type": "code", - "execution_count": 394, - "id": "3714c8f9-58db-40a7-b5a2-6bb7e788b734", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
diff. list Mean(KNN)diff. list Var.(KNN)
1%_saleprice170.0004.240000e+04
5%_saleprice444.9442.554554e+06
10%_saleprice564.7846.304767e+06
\n", - "
" - ], - "text/plain": [ - " diff. list Mean(KNN) diff. list Var.(KNN)\n", - "1%_saleprice 170.000 4.240000e+04\n", - "5%_saleprice 444.944 2.554554e+06\n", - "10%_saleprice 564.784 6.304767e+06" - ] - }, - "execution_count": 394, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_knn_saleprice" - ] - }, - { - "cell_type": "markdown", - "id": "fd7608a8-c5fb-425c-a340-af01801ee349", - "metadata": {}, - "source": [ - "#### Calculate the mean and var of list of diff. KNN - SalePrice scaled" - ] - }, - { - "cell_type": "code", - "execution_count": 395, - "id": "bb03017f-3d91-48d9-8ebf-7cb5c25fadc3", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The mean of 1% is 0.0 and varience 1% is 0.0\n", - "The mean of 5% is 2.6301902513541363e-05 and varience 5% is 2.134349753649814e-08\n", - "The mean of 10% is 2.6301902513541363e-05 and varience 10% is 1.417383473391258e-08\n" - ] - } - ], - "source": [ - "m1 = sum(sp_scaled_diff_1) / len(sp_scaled_diff_1)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res1 = sum((xi - m1) ** 2 for xi in sp_scaled_diff_1) / len(sp_scaled_diff_1)\n", - "\n", - "m5 = sum(sp_scaled_diff_5) / len(sp_scaled_diff_5)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res5 = sum((xii - m5) ** 2 for xii in sp_scaled_diff_5) / len(sp_scaled_diff_5)\n", - "\n", - "\n", - "m10 = sum(sp_scaled_diff_10) / len(sp_scaled_diff_10)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_scaled_diff_10) / len(sp_scaled_diff_10)\n", - "\n", - "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", - "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", - "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 396, - "id": "290d8db2-c9f4-4028-ab44-ad68c9e7b3c5", - "metadata": {}, - "outputs": [], - "source": [ - "df_knn_saleprice_scaled = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", - " '5%_saleprice': [m5, var_res5],\n", - " '10%_saleprice': [m10, var_res10]}, orient='index')\n", - "df_knn_saleprice_scaled.columns=['diff. list Mean(KNN) scaled', 'diff. list Var.(KNN) scaled']" - ] - }, - { - "cell_type": "code", - "execution_count": 397, - "id": "89347fd7-d87d-42bb-b375-a75417c395de", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
diff. list Mean(KNN) scaleddiff. list Var.(KNN) scaled
1%_saleprice0.0000000.000000e+00
5%_saleprice0.0000262.134350e-08
10%_saleprice0.0000321.417383e-08
\n", - "
" - ], - "text/plain": [ - " diff. list Mean(KNN) scaled diff. list Var.(KNN) scaled\n", - "1%_saleprice 0.000000 0.000000e+00\n", - "5%_saleprice 0.000026 2.134350e-08\n", - "10%_saleprice 0.000032 1.417383e-08" - ] - }, - "execution_count": 397, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_knn_saleprice_scaled" - ] - }, - { - "cell_type": "markdown", - "id": "c984dc69-f85f-4f1b-8c94-4afb48c1c8db", - "metadata": {}, - "source": [ - "### Perform MEAN imputation" - ] - }, - { - "cell_type": "code", - "execution_count": 398, - "id": "008bc14f-45e7-42d8-b843-2fee7bcf26c2", - "metadata": {}, - "outputs": [], - "source": [ - "df_saleprice2 = df_saleprice.copy(deep=True)\n", - "df_saleprice_scaled2 = df_saleprice_scaled.copy(deep=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 399, - "id": "bd71dc1a-f137-46ed-bf2b-f3d87fd4b6a0", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 1.0\n", - "sp_copy_5_percent sp_copy_5_percent 5.0\n", - "sp_copy_10_percent sp_copy_10_percent 10.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice2))" - ] - }, - { - "cell_type": "code", - "execution_count": 400, - "id": "46237cfd-6361-466f-b66f-32f5940149d6", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 1.0\n", - "sp_copy_5_percent sp_copy_5_percent 5.0\n", - "sp_copy_10_percent sp_copy_10_percent 10.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice_scaled2))" - ] - }, - { - "cell_type": "markdown", - "id": "64465299-5620-47b9-a28d-afb5494f279e", - "metadata": {}, - "source": [ - "#### Impute Mean values in missing for saleprice and saleprice_scaled" - ] - }, - { - "cell_type": "code", - "execution_count": 401, - "id": "28cf6b75-eebf-4758-94ec-4b3536f2c659", - "metadata": {}, - "outputs": [], - "source": [ - "df_saleprice2['sp_copy_1_percent'] = df_saleprice2['sp_copy_1_percent'].fillna(df_saleprice2['sp_copy_1_percent'].mean())\n", - "df_saleprice2['sp_copy_5_percent'] = df_saleprice2['sp_copy_5_percent'].fillna(df_saleprice2['sp_copy_5_percent'].mean())\n", - "df_saleprice2['sp_copy_10_percent'] = df_saleprice2['sp_copy_10_percent'].fillna(df_saleprice2['sp_copy_10_percent'].mean())" - ] - }, - { - "cell_type": "code", - "execution_count": 402, - "id": "2409dd8c-3cd0-4742-b0ac-14dea1fdb504", - "metadata": {}, - "outputs": [], - "source": [ - "df_saleprice_scaled2['sp_copy_1_percent'] = df_saleprice_scaled2['sp_copy_1_percent'].fillna(df_saleprice_scaled2['sp_copy_1_percent'].mean())\n", - "df_saleprice_scaled2['sp_copy_5_percent'] = df_saleprice_scaled2['sp_copy_5_percent'].fillna(df_saleprice_scaled2['sp_copy_5_percent'].mean())\n", - "df_saleprice_scaled2['sp_copy_10_percent'] = df_saleprice_scaled2['sp_copy_10_percent'].fillna(df_saleprice_scaled2['sp_copy_10_percent'].mean())" - ] - }, - { - "cell_type": "markdown", - "id": "62377754-b682-45e5-8faa-1a4a186bd3c7", - "metadata": {}, - "source": [ - "#### After MEAN imputation - Saleprice and saleprice scaled" - ] - }, - { - "cell_type": "code", - "execution_count": 403, - "id": "6c448556-55f4-4685-aed2-6b67d5ad8a2a", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 0.0\n", - "sp_copy_5_percent sp_copy_5_percent 0.0\n", - "sp_copy_10_percent sp_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice2))" - ] - }, - { - "cell_type": "code", - "execution_count": 404, - "id": "d9775fbf-7a72-4352-b446-488e9d25b6a2", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 0.0\n", - "sp_copy_5_percent sp_copy_5_percent 0.0\n", - "sp_copy_10_percent sp_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice_scaled2))" - ] - }, - { - "cell_type": "code", - "execution_count": 407, - "id": "136f87e6-a4af-4229-b36a-695f712deee5", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
571120000120000.0120000.000000182343.817778
2223500223500.0223500.000000223500.000000
313375000375000.0375000.000000375000.000000
377340000340000.0182457.342105182343.817778
987395192395192.0395192.000000395192.000000
\n", - "
" - ], - "text/plain": [ - " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", - "571 120000 120000.0 120000.000000 182343.817778\n", - "2 223500 223500.0 223500.000000 223500.000000\n", - "313 375000 375000.0 375000.000000 375000.000000\n", - "377 340000 340000.0 182457.342105 182343.817778\n", - "987 395192 395192.0 395192.000000 395192.000000" - ] - }, - "execution_count": 407, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_saleprice2.sample(5)" - ] - }, - { - "cell_type": "code", - "execution_count": 409, - "id": "784cb61c-78f8-4b31-b709-379c50024dca", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
2160.2431610.2431610.2431610.243161
10.2035830.2035830.2035830.203583
5750.1160950.1160950.1160950.116095
3970.1869180.1869180.1869180.205253
7030.1459520.1459520.1459520.145952
\n", - "
" - ], - "text/plain": [ - " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", - "216 0.243161 0.243161 0.243161 0.243161\n", - "1 0.203583 0.203583 0.203583 0.203583\n", - "575 0.116095 0.116095 0.116095 0.116095\n", - "397 0.186918 0.186918 0.186918 0.205253\n", - "703 0.145952 0.145952 0.145952 0.145952" - ] - }, - "execution_count": 409, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_saleprice_scaled2.sample(5)" - ] - }, - { - "cell_type": "markdown", - "id": "33c1f3b7-5afc-45cb-8b43-9682ec87156d", - "metadata": {}, - "source": [ - "#### Create List of differences for saleprice and saleprice_scaled Dataframes" - ] - }, - { - "cell_type": "code", - "execution_count": 410, - "id": "d2faf410-f83e-4ccb-89d4-e6f8c7adffbb", - "metadata": {}, - "outputs": [], - "source": [ - "# create list of difference bwtween imputed and orginal value\n", - "\n", - "sp_mean_diff_1 = []\n", - "sp_mean_diff_5 = []\n", - "sp_mean_diff_10 = []\n", - "count = 0\n", - "\n", - "for i in sp_1_idx:\n", - " count +=1\n", - " diff1 = abs(df_saleprice2['sp_copy_1_percent'][i] - df_saleprice2['SalePrice'][i])\n", - " sp_mean_diff_1.append(diff1)\n", - " \n", - "\n", - "for i in sp_5_idx:\n", - " diff5 = abs(df_saleprice2['sp_copy_5_percent'][i] - df_saleprice2['SalePrice'][i])\n", - " sp_mean_diff_5.append(diff5)\n", - "\n", - "for i in sp_10_idx:\n", - " diff10 = abs(df_saleprice2['sp_copy_10_percent'][i] - df_saleprice2['SalePrice'][i])\n", - " sp_mean_diff_10.append(diff10)" - ] - }, - { - "cell_type": "code", - "execution_count": 411, - "id": "789b07c5-530a-4111-8c97-f5297f7da5e4", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(sp_mean_diff_1))\n", - "print(len(sp_mean_diff_5))\n", - "print(len(sp_mean_diff_10))" - ] - }, - { - "cell_type": "code", - "execution_count": 412, - "id": "4fec222c-2420-41af-9e2a-d9773e1d6259", - "metadata": {}, - "outputs": [], - "source": [ - "# create list of difference bwtween imputed and orginal value\n", - "\n", - "sp_scaled_mean_diff_1 = []\n", - "sp_scaled_mean_diff_5 = []\n", - "sp_scaled_mean_diff_10 = []\n", - "count = 0\n", - "\n", - "for i in sp_1_idx:\n", - " count +=1\n", - " diff1 = abs(df_saleprice_scaled2['sp_copy_1_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", - " sp_scaled_mean_diff_1.append(diff1)\n", - " \n", - "\n", - "for i in sp_5_idx:\n", - " diff5 = abs(df_saleprice_scaled2['sp_copy_5_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", - " sp_scaled_mean_diff_5.append(diff5)\n", - "\n", - "for i in sp_10_idx:\n", - " diff10 = abs(df_saleprice_scaled2['sp_copy_10_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", - " sp_scaled_mean_diff_10.append(diff10)" - ] - }, - { - "cell_type": "code", - "execution_count": 413, - "id": "de9bf1de-68fe-4894-915a-7069b386123f", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(sp_scaled_mean_diff_1))\n", - "print(len(sp_scaled_mean_diff_5))\n", - "print(len(sp_scaled_mean_diff_10))" - ] - }, - { - "cell_type": "markdown", - "id": "f7b93757-d1a7-41a1-85fa-3ee77734be5b", - "metadata": {}, - "source": [ - "#### Calculate mean and var of list of diff. - MEAN impute SalePrice" - ] - }, - { - "cell_type": "code", - "execution_count": 414, - "id": "c60d3aad-33f0-48f4-8bb0-f8af45e33e1e", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The mean of 1% is 55971.63676767676 and varience 1% is 1103367192.190047\n", - "The mean of 5% is 58478.24210526314 and varience 5% is 3139731297.2794733\n", - "The mean of 10% is 58478.24210526314 and varience 10% is 3846674638.263318\n" - ] - } - ], - "source": [ - "m1 = sum(sp_mean_diff_1) / len(sp_mean_diff_1)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res1 = sum((xi - m1) ** 2 for xi in sp_mean_diff_1) / len(sp_mean_diff_1)\n", - "\n", - "m5 = sum(sp_mean_diff_5) / len(sp_mean_diff_5)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res5 = sum((xii - m5) ** 2 for xii in sp_mean_diff_5) / len(sp_mean_diff_5)\n", - "\n", - "\n", - "m10 = sum(sp_mean_diff_10) / len(sp_mean_diff_10)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_mean_diff_10) / len(sp_mean_diff_10)\n", - "\n", - "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", - "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", - "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 415, - "id": "e7f6e5cf-4eaa-4bfe-add2-fc7f600941b7", - "metadata": {}, - "outputs": [], - "source": [ - "df_mean_saleprice = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", - " '5%_saleprice': [m5, var_res5],\n", - " '10%_saleprice': [m10, var_res10]}, orient='index')\n", - "df_mean_saleprice.columns=['diff. list Mean(MI)', 'diff. list Var.(MI)']" - ] - }, - { - "cell_type": "code", - "execution_count": 416, - "id": "cc37eeaf-e3cd-4a83-870d-fab7037eeffe", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
diff. list Mean(MI)diff. list Var.(MI)
1%_saleprice55971.6367681.103367e+09
5%_saleprice58478.2421053.139731e+09
10%_saleprice61028.7099113.846675e+09
\n", - "
" - ], - "text/plain": [ - " diff. list Mean(MI) diff. list Var.(MI)\n", - "1%_saleprice 55971.636768 1.103367e+09\n", - "5%_saleprice 58478.242105 3.139731e+09\n", - "10%_saleprice 61028.709911 3.846675e+09" - ] - }, - "execution_count": 416, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_mean_saleprice" - ] - }, - { - "cell_type": "markdown", - "id": "f405f073-1b45-47e8-873b-7a9d34ad0e5c", - "metadata": {}, - "source": [ - "#### Calculate mean and var of list of diff. - MEAN impute SalePrice scaled" - ] - }, - { - "cell_type": "code", - "execution_count": 417, - "id": "2516b4f7-6b79-4636-9bd5-0738343ea355", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The mean of 1% is 0.0 and varience 1% is 0.0\n", - "The mean of 5% is 0.00893610697344667 and varience 5% is 0.0014044730755095036\n", - "The mean of 10% is 0.00893610697344667 and varience 10% is 0.0004431848362889144\n" - ] - } - ], - "source": [ - "m1 = sum(sp_scaled_mean_diff_1) / len(sp_scaled_mean_diff_1)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res1 = sum((xi - m1) ** 2 for xi in sp_scaled_mean_diff_1) / len(sp_scaled_mean_diff_1)\n", - "\n", - "m5 = sum(sp_scaled_mean_diff_5) / len(sp_scaled_mean_diff_5)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res5 = sum((xii - m5) ** 2 for xii in sp_scaled_mean_diff_5) / len(sp_scaled_mean_diff_5)\n", - "\n", - "\n", - "m10 = sum(sp_scaled_mean_diff_10) / len(sp_scaled_mean_diff_10)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_scaled_mean_diff_10) / len(sp_scaled_mean_diff_10)\n", - "\n", - "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", - "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", - "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 418, - "id": "fe6a93b8-d6cb-4d7d-856b-ab4ee8fe78fc", - "metadata": {}, - "outputs": [], - "source": [ - "df_mean_saleprice_scaled = pd.DataFrame.from_dict({'1%_saleprice_scaled': [m1, var_res1],\n", - " '5%_saleprice_scaled': [m5, var_res5],\n", - " '10%_saleprice_scaled': [m10, var_res10]}, orient='index')\n", - "df_mean_saleprice_scaled.columns=['diff. list Mean(MI) scaled', 'diff. list Var.(MI) scaled']" - ] - }, - { - "cell_type": "code", - "execution_count": 419, - "id": "e74c35ed-7c2d-44ab-b6c2-4d81c2c6b6bb", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
diff. list Mean(MI) scaleddiff. list Var.(MI) scaled
1%_saleprice_scaled0.0000000.000000
5%_saleprice_scaled0.0089360.001404
10%_saleprice_scaled0.0074920.000443
\n", - "
" - ], - "text/plain": [ - " diff. list Mean(MI) scaled diff. list Var.(MI) scaled\n", - "1%_saleprice_scaled 0.000000 0.000000\n", - "5%_saleprice_scaled 0.008936 0.001404\n", - "10%_saleprice_scaled 0.007492 0.000443" - ] - }, - "execution_count": 419, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_mean_saleprice_scaled" - ] - }, - { - "cell_type": "markdown", - "id": "876b979a-f5c4-43a7-9ead-d5d866bef078", - "metadata": {}, - "source": [ - "# 2.2 Housing Data Results - KNN and MEAN" - ] - }, - { - "cell_type": "code", - "execution_count": 420, - "id": "fea4b521-03a3-46ce-b217-27225eb868af", - "metadata": {}, - "outputs": [], - "source": [ - "results1 = pd.concat([df_knn_saleprice, df_knn_saleprice_scaled, df_mean_saleprice, df_mean_saleprice_scaled])" - ] - }, - { - "cell_type": "code", - "execution_count": 421, - "id": "631729d6-e853-4ba5-b5fd-4e632ec00d5f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
diff. list Mean(KNN)diff. list Var.(KNN)diff. list Mean(KNN) scaleddiff. list Var.(KNN) scaleddiff. list Mean(MI)diff. list Var.(MI)diff. list Mean(MI) scaleddiff. list Var.(MI) scaled
1%_saleprice170.0004.240000e+04NaNNaNNaNNaNNaNNaN
5%_saleprice444.9442.554554e+06NaNNaNNaNNaNNaNNaN
10%_saleprice564.7846.304767e+06NaNNaNNaNNaNNaNNaN
1%_salepriceNaNNaN0.0000000.000000e+00NaNNaNNaNNaN
5%_salepriceNaNNaN0.0000262.134350e-08NaNNaNNaNNaN
10%_salepriceNaNNaN0.0000321.417383e-08NaNNaNNaNNaN
1%_salepriceNaNNaNNaNNaN55971.6367681.103367e+09NaNNaN
5%_salepriceNaNNaNNaNNaN58478.2421053.139731e+09NaNNaN
10%_salepriceNaNNaNNaNNaN61028.7099113.846675e+09NaNNaN
1%_saleprice_scaledNaNNaNNaNNaNNaNNaN0.0000000.000000
5%_saleprice_scaledNaNNaNNaNNaNNaNNaN0.0089360.001404
10%_saleprice_scaledNaNNaNNaNNaNNaNNaN0.0074920.000443
\n", - "
" - ], - "text/plain": [ - " diff. list Mean(KNN) diff. list Var.(KNN) \\\n", - "1%_saleprice 170.000 4.240000e+04 \n", - "5%_saleprice 444.944 2.554554e+06 \n", - "10%_saleprice 564.784 6.304767e+06 \n", - "1%_saleprice NaN NaN \n", - "5%_saleprice NaN NaN \n", - "10%_saleprice NaN NaN \n", - "1%_saleprice NaN NaN \n", - "5%_saleprice NaN NaN \n", - "10%_saleprice NaN NaN \n", - "1%_saleprice_scaled NaN NaN \n", - "5%_saleprice_scaled NaN NaN \n", - "10%_saleprice_scaled NaN NaN \n", - "\n", - " diff. list Mean(KNN) scaled \\\n", - "1%_saleprice NaN \n", - "5%_saleprice NaN \n", - "10%_saleprice NaN \n", - "1%_saleprice 0.000000 \n", - "5%_saleprice 0.000026 \n", - "10%_saleprice 0.000032 \n", - "1%_saleprice NaN \n", - "5%_saleprice NaN \n", - "10%_saleprice NaN \n", - "1%_saleprice_scaled NaN \n", - "5%_saleprice_scaled NaN \n", - "10%_saleprice_scaled NaN \n", - "\n", - " diff. list Var.(KNN) scaled diff. list Mean(MI) \\\n", - "1%_saleprice NaN NaN \n", - "5%_saleprice NaN NaN \n", - "10%_saleprice NaN NaN \n", - "1%_saleprice 0.000000e+00 NaN \n", - "5%_saleprice 2.134350e-08 NaN \n", - "10%_saleprice 1.417383e-08 NaN \n", - "1%_saleprice NaN 55971.636768 \n", - "5%_saleprice NaN 58478.242105 \n", - "10%_saleprice NaN 61028.709911 \n", - "1%_saleprice_scaled NaN NaN \n", - "5%_saleprice_scaled NaN NaN \n", - "10%_saleprice_scaled NaN NaN \n", - "\n", - " diff. list Var.(MI) diff. list Mean(MI) scaled \\\n", - "1%_saleprice NaN NaN \n", - "5%_saleprice NaN NaN \n", - "10%_saleprice NaN NaN \n", - "1%_saleprice NaN NaN \n", - "5%_saleprice NaN NaN \n", - "10%_saleprice NaN NaN \n", - "1%_saleprice 1.103367e+09 NaN \n", - "5%_saleprice 3.139731e+09 NaN \n", - "10%_saleprice 3.846675e+09 NaN \n", - "1%_saleprice_scaled NaN 0.000000 \n", - "5%_saleprice_scaled NaN 0.008936 \n", - "10%_saleprice_scaled NaN 0.007492 \n", - "\n", - " diff. list Var.(MI) scaled \n", - "1%_saleprice NaN \n", - "5%_saleprice NaN \n", - "10%_saleprice NaN \n", - "1%_saleprice NaN \n", - "5%_saleprice NaN \n", - "10%_saleprice NaN \n", - "1%_saleprice NaN \n", - "5%_saleprice NaN \n", - "10%_saleprice NaN \n", - "1%_saleprice_scaled 0.000000 \n", - "5%_saleprice_scaled 0.001404 \n", - "10%_saleprice_scaled 0.000443 " - ] - }, - "execution_count": 421, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "results1" - ] - }, - { - "cell_type": "code", - "execution_count": 422, - "id": "a255c5bc-c062-4029-8f18-0c7644ca1d7c", - "metadata": {}, - "outputs": [], - "source": [ - "results1.to_csv('housing_data_saleprice_KNN_Mean_results.csv')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c9b0060e-129c-465e-a2a5-c3113ac4b936", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "pytorch_kz_env", - "language": "python", - "name": "pytorch_kz_env" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.11" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From db26a047fc3323e127084b14d092b7538e1370a2 Mon Sep 17 00:00:00 2001 From: Shesh Narayan Gupta <91396937+SheshNGupta@users.noreply.github.com> Date: Mon, 6 Jun 2022 18:47:35 -0400 Subject: [PATCH 3/8] Delete random_numbers_1000.csv --- notebooks/random_numbers_1000.csv | 1001 ----------------------------- 1 file changed, 1001 deletions(-) delete mode 100644 notebooks/random_numbers_1000.csv diff --git a/notebooks/random_numbers_1000.csv b/notebooks/random_numbers_1000.csv deleted file mode 100644 index b988bad..0000000 --- a/notebooks/random_numbers_1000.csv +++ /dev/null @@ -1,1001 +0,0 @@ -,number -0,0.14461602473455892 -1,0.07751503129173953 -2,0.15593297226701996 -3,0.09720879582042008 -4,0.32375017402684214 -5,0.686823745565341 -6,0.7068035159437503 -7,0.9167216890721541 -8,0.6352048775376901 -9,0.17132904054220055 -10,0.8159661332230377 -11,0.16475992352396795 -12,0.0409370627667629 -13,0.16726651783050572 -14,0.9709841404608549 -15,0.7314646963631376 -16,0.3426860074270154 -17,0.03452867763070577 -18,0.3574832521777054 -19,0.5745017180628896 -20,0.9464018964648249 -21,0.17346442317598176 -22,0.7981877585797893 -23,0.7809787573425518 -24,0.5238193208352585 -25,0.7821735568253659 -26,0.9934482007890996 -27,0.4184423331593896 -28,0.2599014381523176 -29,0.79832254805514 -30,0.6041862665264831 -31,0.3819864440431342 -32,0.8521701748665009 -33,0.3126469510739037 -34,0.573165703657289 -35,0.6265563684951247 -36,0.739416657331853 -37,0.012060677103418738 -38,0.9526287180476393 -39,0.3919187115227588 -40,0.2638910529614693 -41,0.28055121530104343 -42,0.5573435702875359 -43,0.810470016341365 -44,0.5595615325523974 -45,0.408760756112558 -46,0.8630495060594643 -47,0.8614542990838314 -48,0.8236790421079785 -49,0.445107982060686 -50,0.9240480240430241 -51,0.17212099430841699 -52,0.2821871607285322 -53,0.37501938886942654 -54,0.4401439635045862 -55,0.1316322082815632 -56,0.06144638522796442 -57,0.9719025725097523 -58,0.6437628611013991 -59,0.18965508288943556 -60,0.06647339880458658 -61,0.9432875072199843 -62,0.9635593500723799 -63,0.8159138106628153 -64,0.5268141359426226 -65,0.8097577290919002 -66,0.10832871122562193 -67,0.513926863373751 -68,0.5574679474011387 -69,0.23117155673924017 -70,0.7988683863124257 -71,0.14232155967666804 -72,0.4114506075996932 -73,0.028703811806714996 -74,0.15511224785648736 -75,0.5179635133770123 -76,0.6343922699321491 -77,0.5442703351502044 -78,0.2051777299642784 -79,0.9514959457303863 -80,0.8616963431169906 -81,0.9260797192939593 -82,0.6837050092238902 -83,0.6341651538285088 -84,0.47009701258761005 -85,0.6290009641921982 -86,0.9976095248457479 -87,0.6766165875739423 -88,0.34775785853790764 -89,0.24721164403263118 -90,0.7644613432516099 -91,0.8578411267105046 -92,0.02847593788616165 -93,0.7352417508308864 -94,0.6439666934556955 -95,0.4145386388213331 -96,0.9000774058908544 -97,0.20984159212668807 -98,0.5736834527493817 -99,0.5731814122745401 -100,0.39175113064248857 -101,0.9414042225202869 -102,0.35865018640717594 -103,0.34942147114579614 -104,0.6287322577319368 -105,0.5640558939154473 -106,0.9935619072485498 -107,0.3230972874260011 -108,0.30050448033239197 -109,0.8535359869169682 -110,0.8186071691655027 -111,0.8507126794809163 -112,0.11848293702439716 -113,0.34039997170201786 -114,0.24848934681272938 -115,0.8713564278618446 -116,0.7192981378269337 -117,0.5612771185476495 -118,0.3001718489057721 -119,0.5582566234063182 -120,0.20715922789136187 -121,0.24718349962906172 -122,0.9096809353144786 -123,0.9496126251594162 -124,0.19298962232482253 -125,0.6823143045816399 -126,0.2950869303839806 -127,0.700872866143569 -128,0.9246255564110638 -129,0.3918411220739513 -130,0.5046695500081352 -131,0.40242035593564884 -132,0.5348070625842399 -133,0.6190144238291141 -134,0.6527067332418969 -135,0.7798534811708006 -136,0.8371435153002993 -137,0.7256654504898371 -138,0.19486710733751433 -139,0.17061227388763445 -140,0.3866266766943538 -141,0.9861342050546121 -142,0.12499832976125236 -143,0.4076100289319884 -144,0.24405060656519562 -145,0.24658924623282708 -146,0.31303910086742404 -147,0.13582549628998997 -148,0.4267352707490074 -149,0.6860815270131422 -150,0.2632104445655937 -151,0.7095899448677616 -152,0.30697391312148903 -153,0.15020764760355143 -154,0.33008237434926957 -155,0.24730791798017127 -156,0.7732146302465086 -157,0.3986960975344779 -158,0.878302550945857 -159,0.3073561016445441 -160,0.21123045619113257 -161,0.5806664509148879 -162,0.8984369263318096 -163,0.8363942698985983 -164,0.2812623945509036 -165,0.10724622968453401 -166,0.5703943012638906 -167,0.7309007201275504 -168,0.6865969394598082 -169,0.17355862259884247 -170,0.41747139600619776 -171,0.8046329781439144 -172,0.29734663284924356 -173,0.6874907011989809 -174,0.27926268019676004 -175,0.16857167772740067 -176,0.808320103826969 -177,0.22397888146185907 -178,0.4961137292567884 -179,0.39791460648438426 -180,0.749624236829485 -181,0.8166672255804612 -182,0.5416591595071085 -183,0.7784968348980786 -184,0.5246274130247313 -185,0.6165788811775392 -186,0.18993747860389354 -187,0.4375903866391334 -188,0.8977799452863308 -189,0.8974808404906014 -190,0.7833353163003136 -191,0.5735505446147654 -192,0.8592478266591742 -193,0.555628191461239 -194,0.29218190018690193 -195,0.6823254024415241 -196,0.7253556992028032 -197,0.6348979373592366 -198,0.738955355288769 -199,0.40548956817360793 -200,0.9965074549246696 -201,0.6680475408833246 -202,0.4753087000915296 -203,0.8154531729554498 -204,0.39674071637462927 -205,0.3465424212251109 -206,0.3010873336265142 -207,0.3453059844140016 -208,0.3376649450698975 -209,0.4520568726021712 -210,0.7102711170123417 -211,0.5676304992868505 -212,0.246451823758292 -213,0.3045971494873321 -214,0.9191799326806603 -215,0.09062317707388845 -216,0.6456030768852257 -217,0.8145182625891805 -218,0.3502989381872097 -219,0.5454669053640021 -220,0.9229510982790893 -221,0.5017605011244138 -222,0.5814298938642755 -223,0.212077064497179 -224,0.9084673048697015 -225,0.8420689009087419 -226,0.09544595716628035 -227,0.5428219386076877 -228,0.334040059452826 -229,0.5883742904617911 -230,0.6681527250828868 -231,0.920066967991107 -232,0.6980014815164323 -233,0.5140583511099508 -234,0.574062901794968 -235,0.8671650796521554 -236,0.29309281744572635 -237,0.6255644089859125 -238,0.41377688075614283 -239,0.6541722779053092 -240,0.7022455597573617 -241,0.7027961835253476 -242,0.32866027307469425 -243,0.9438823677034145 -244,0.6392304917718383 -245,0.35610068008813955 -246,0.5109988272940061 -247,0.7549785046509206 -248,0.911498498846909 -249,0.7269132750864981 -250,0.43346849143235944 -251,0.9613052659398792 -252,0.06410207161162618 -253,0.7224542800953787 -254,0.8605028822342475 -255,0.9379303538857604 -256,0.11890111097053702 -257,0.06560232272410749 -258,0.9815175258058294 -259,0.5816233574934034 -260,0.3223771211316614 -261,0.010794999021216611 -262,0.48232848210912416 -263,0.6888091652734284 -264,0.7510123953710294 -265,0.3931342633771988 -266,0.4285185942589612 -267,0.028804295777431044 -268,0.7471054611787746 -269,0.5188475627728396 -270,0.3699806335289325 -271,0.6733240981418717 -272,0.455659972278607 -273,0.8865920570538507 -274,0.9773310825483524 -275,0.9114683092627319 -276,0.7234740743957591 -277,0.47378640650570536 -278,0.9044322182580692 -279,0.6490971485609244 -280,0.9325706015784121 -281,0.15806103989245135 -282,0.20431604755502109 -283,0.9516960107212825 -284,0.17933034496530176 -285,0.10632943259433447 -286,0.20529052976827733 -287,0.26644977396966907 -288,0.990842732357776 -289,0.6626056375310618 -290,0.8934023242009224 -291,0.6087787761836707 -292,0.6622123753279109 -293,0.2795715500728444 -294,0.7356211918792761 -295,0.023450952083761578 -296,0.29930766895885463 -297,0.9605253146799532 -298,0.4773205356946918 -299,0.896685482640458 -300,0.20788119046629716 -301,0.21907107928738412 -302,0.3417751133430835 -303,0.8785812995819484 -304,0.7629857606713326 -305,0.10409839946928867 -306,0.5375122454578438 -307,0.12610808266796247 -308,0.9207106566062669 -309,0.6614470367535862 -310,0.6646296886200127 -311,0.02517423927343887 -312,0.5355435671395777 -313,0.9639505712726043 -314,0.8427700240424094 -315,0.5173256280251634 -316,0.6809361625916177 -317,0.25269387981635383 -318,7.39014254360626e-05 -319,0.6832379417409375 -320,0.3814705574477538 -321,0.2953366513034189 -322,0.8601629667491553 -323,0.4116625534183441 -324,0.20248827761656263 -325,0.0950677170887495 -326,0.37432668808858527 -327,0.5002586204770462 -328,0.5903766299860601 -329,0.4069147751233232 -330,0.46587616114566655 -331,0.20767274566478722 -332,0.4405095567714371 -333,0.7561490702983013 -334,0.9691510044256642 -335,0.9835349892112961 -336,0.08167974686852508 -337,0.011831197129136273 -338,0.2533369151703784 -339,0.7258386397040382 -340,0.1533224004672512 -341,0.16976063838308353 -342,0.3535761067133554 -343,0.9558080514913609 -344,0.34787269425215606 -345,0.6384858181781367 -346,0.19142808499268715 -347,0.3723886499126876 -348,0.4610104267479409 -349,0.7386414627232165 -350,0.5547224736511918 -351,0.07560627992824742 -352,0.38543929036328295 -353,0.023870001618478964 -354,0.08490118558975879 -355,0.9523181200843006 -356,0.835121255953561 -357,0.8313253101018512 -358,0.4477164423027221 -359,0.427173224834863 -360,0.2607502696316568 -361,0.6518880149684392 -362,0.989596091701078 -363,0.4737188317675711 -364,0.951663574431818 -365,0.6389835611029937 -366,0.4255250760028354 -367,0.36494823219306194 -368,0.10394871793754767 -369,0.08787887115953141 -370,0.05185866702404662 -371,0.5729228447658512 -372,0.3557153056497062 -373,0.14169200930635462 -374,0.6026259214704931 -375,0.6780938325392907 -376,0.0019220493053816456 -377,0.14423401505903843 -378,0.31021740847078616 -379,0.26542859991807166 -380,0.05293698137098246 -381,0.5447383348415423 -382,0.19410883367100906 -383,0.2759766462115508 -384,0.6085305795585376 -385,0.19018564330800136 -386,0.6001023952936514 -387,0.5500869240450543 -388,0.308558554189692 -389,0.613015054522192 -390,0.5053671279653127 -391,0.8033565610860482 -392,0.3190316438196028 -393,0.8430688477494918 -394,0.3907441626865247 -395,0.3749010705929905 -396,0.20374147066354986 -397,0.4445572005828903 -398,0.4325615226381033 -399,0.747347832034453 -400,0.1408237945119577 -401,0.5629196065967164 -402,0.8883715667513505 -403,0.7262344816634011 -404,0.1015240156369166 -405,0.6274596622730756 -406,0.6724938834493908 -407,0.45890555605876826 -408,0.253862163313197 -409,0.20213399227024142 -410,0.9431472444002996 -411,0.4412716272261822 -412,0.6778537756613036 -413,0.5609208700560778 -414,0.7852790417028147 -415,0.8301487622409094 -416,0.0695242591856422 -417,0.5342345164968271 -418,0.020198821857018268 -419,0.11932836566667071 -420,0.7351542137502673 -421,0.879354084852934 -422,0.060390921051916124 -423,0.3517659280158124 -424,0.25831407832342757 -425,0.25041309629182773 -426,0.6324032934179679 -427,0.6905116746744266 -428,0.038781141504878325 -429,0.11872222658971077 -430,0.3402172182577837 -431,0.1117834948318035 -432,0.8974663997148172 -433,0.7721061886641211 -434,0.467763325594456 -435,0.45960484726135 -436,0.11940893902740168 -437,0.8892320824757846 -438,0.056170722740824464 -439,0.8348974660229447 -440,0.8328276290445746 -441,0.015421942378315512 -442,0.6078039146470725 -443,0.9797170916017848 -444,0.817871594488278 -445,0.4281570072853328 -446,0.9826586617461194 -447,0.5714323337805088 -448,0.5655480118995616 -449,0.13163751508874266 -450,0.5727166298844355 -451,0.3876989055629705 -452,0.24625748760449773 -453,0.062376725489559304 -454,0.1868295868142189 -455,0.07519337399332371 -456,0.8615125038568271 -457,0.0430765434686432 -458,0.7784279481001283 -459,0.1559200654309939 -460,0.28457480300272475 -461,0.4833371043049315 -462,0.21688560355701902 -463,0.051055375260327884 -464,0.8764119752087609 -465,0.03830180552041673 -466,0.899276170682331 -467,0.5326669068942715 -468,0.7966592760107886 -469,0.5977938689767619 -470,0.35735055753216216 -471,0.7502306585594846 -472,0.27262195939610845 -473,0.3367003915054816 -474,0.3718378858875636 -475,0.7252726856566986 -476,0.6108078470654391 -477,0.160140124957443 -478,0.640641195165919 -479,0.819043970313203 -480,0.9460930077740923 -481,0.3955113176387407 -482,0.08228064172201954 -483,0.5692148152461914 -484,0.9379027430417781 -485,0.7262721958954546 -486,0.9974714724600596 -487,0.9816411645054782 -488,0.02801478549452141 -489,0.35876394018958924 -490,0.46224300725504386 -491,0.07977812492324099 -492,0.7825821331768681 -493,0.7728747320072956 -494,0.18411522733742114 -495,0.9349933626453013 -496,0.3305156463539396 -497,0.05247324921620988 -498,0.3784435570491954 -499,0.8296025413407634 -500,0.44108727645927825 -501,0.2993358032378495 -502,0.8631126359025391 -503,0.250262827945147 -504,0.09566738091105942 -505,0.7130474946994906 -506,0.2235781443128807 -507,0.7026149405611689 -508,0.7224945548679957 -509,0.6170012611217315 -510,0.20186432914831431 -511,0.7852714452298651 -512,0.8903242744728199 -513,0.1399056906045737 -514,0.17026945833848617 -515,0.514586763470415 -516,0.9736100357614889 -517,0.7746591507784915 -518,0.29437001890274195 -519,0.8027253084378705 -520,0.08386991518130038 -521,0.09136100092018629 -522,0.8983567502463687 -523,0.8868693311046169 -524,0.533466309836137 -525,0.42900189716927073 -526,0.1821870276409372 -527,0.4315150943786541 -528,0.47383956070476785 -529,0.42647315825719867 -530,0.20889106515275513 -531,0.15615589390655582 -532,0.7683598815481214 -533,0.8407774935346721 -534,0.4599058924434972 -535,0.20858605861422153 -536,0.25419023941340724 -537,0.03537597137641857 -538,0.5037011171417803 -539,0.319855948227728 -540,0.6143932185624659 -541,0.11338109816795006 -542,0.6071773224023549 -543,0.6320103598568474 -544,0.17739418618305125 -545,0.9193076779462215 -546,0.539317629461803 -547,0.361121293498606 -548,0.8225521587592494 -549,0.037067189096233966 -550,0.7644376889628157 -551,0.9614375433647248 -552,0.26247829558958613 -553,0.04497704041286332 -554,0.49347237237561237 -555,0.10135820428850206 -556,0.9054759324635467 -557,0.3912479745377101 -558,0.16984308812935767 -559,0.3130327921420567 -560,0.2845393861009978 -561,0.7216547111114262 -562,0.6129838442158642 -563,0.6128072542663652 -564,0.5153838338789999 -565,0.7131085367862817 -566,0.8713477772442941 -567,0.9419360672901563 -568,0.9061770339937525 -569,0.9973713503589123 -570,0.6511737928834931 -571,0.0980714039543844 -572,0.12371358453480508 -573,0.5817580949438432 -574,0.3878197750090975 -575,0.3836838844640248 -576,0.3330772932400339 -577,0.8937920239990277 -578,0.42660379831271933 -579,0.09749777821209016 -580,0.03273234283716975 -581,0.5822939987582022 -582,0.2818759219290342 -583,0.9973773382690185 -584,0.3485811650096795 -585,0.38385951065171464 -586,0.14314846321555819 -587,0.41168484188278187 -588,0.5560325831949468 -589,0.6786651527115524 -590,0.27941662328630534 -591,0.12758615070559087 -592,0.8706880276786881 -593,0.42247163006009736 -594,0.8747921784321767 -595,0.9819789489386005 -596,0.53212913612486 -597,0.6820548577830702 -598,0.14172556124342628 -599,0.8954903213991394 -600,0.8877895505948118 -601,0.2899734461911796 -602,0.39888758518426926 -603,0.5085270928974726 -604,0.5397323464650328 -605,0.5355595876880633 -606,0.6680045600991499 -607,0.07890855054344348 -608,0.36522753036507116 -609,0.7525828516063231 -610,0.8155334605307646 -611,0.948872329161571 -612,0.10085424156574552 -613,0.3063104444859259 -614,0.012248867459916157 -615,0.8332405266792986 -616,0.4477328006875678 -617,0.7381760858313725 -618,0.5381307278002123 -619,0.64442652761133 -620,0.407653279216153 -621,0.988120343671508 -622,0.349242158981631 -623,0.11439639275168989 -624,0.773600974105568 -625,0.3422508667504136 -626,0.35092901992304426 -627,0.6998555631853256 -628,0.5351463864628954 -629,0.6941915466139217 -630,0.27550090759498 -631,0.03955870654832727 -632,0.9737612333749457 -633,0.85659566451438 -634,0.318016024519294 -635,0.07264967870375483 -636,0.6266672136646679 -637,0.5427530067840908 -638,0.08013357115177333 -639,0.27865447324993387 -640,0.8204327600278204 -641,0.6472338718548233 -642,0.8981066937808309 -643,0.9904134149156683 -644,0.7570648348954108 -645,0.04820939759809295 -646,0.49659488586991385 -647,0.2681871451946377 -648,0.05376519761698151 -649,0.1536101940376925 -650,0.2458849441738461 -651,0.19991898782481343 -652,0.49815295225863154 -653,0.7475145062482099 -654,0.5814474904248211 -655,0.9103815228294841 -656,0.8091439841662771 -657,0.044556478634595 -658,0.06582839484468272 -659,0.8723124347377673 -660,0.761407419742959 -661,0.6295611439582762 -662,0.5602756647971817 -663,0.028833108636930782 -664,0.6925154173449602 -665,0.30781547100300766 -666,0.9456746547718861 -667,0.7733519530494579 -668,0.07325928323474962 -669,0.06051359621130603 -670,0.7684091239449635 -671,0.0772898478864189 -672,0.4652145959688888 -673,0.4373876627767307 -674,0.6267684478070814 -675,0.7183418633741062 -676,0.28256468766217413 -677,0.5073826011665699 -678,0.31820311938601464 -679,0.4089168748142118 -680,0.29885921770184043 -681,0.03372851278925548 -682,0.6703170306185748 -683,0.33198869826189814 -684,0.5975405123566822 -685,0.8211657963714585 -686,0.3461079054656666 -687,0.48616250243415104 -688,0.13447950866733605 -689,0.562667191415577 -690,0.7678216928305848 -691,0.4530052286033409 -692,0.5010228200975811 -693,0.4323309760765164 -694,0.36743023729184987 -695,0.1723991626473217 -696,0.4337302869241262 -697,0.24966845326719822 -698,0.642167289966723 -699,0.616830008851879 -700,0.7703637450499222 -701,0.21386173939654995 -702,0.704115745850898 -703,0.6905967742396926 -704,0.14550064889741277 -705,0.6045853103312959 -706,0.03670533871021342 -707,0.7158949195594291 -708,0.5963326610400751 -709,0.7656919572130952 -710,0.16593604258736716 -711,0.37116447793513807 -712,0.8005826062394383 -713,0.041771054650389106 -714,0.6847846478124059 -715,0.4993883882765534 -716,0.1850707225574446 -717,0.5630874044249621 -718,0.37025234599378876 -719,0.7107125656980158 -720,0.4118677519270143 -721,0.7742568360649871 -722,0.8100159822588088 -723,0.3174629757017041 -724,0.5303493054894146 -725,0.8849961235045513 -726,0.3273403729546115 -727,0.6172150375830504 -728,0.15983060531231819 -729,0.4728594510763161 -730,0.4529506215548965 -731,0.5035430872599636 -732,0.004927231548344402 -733,0.1940383807540148 -734,0.14982458424309364 -735,0.8563549025851751 -736,0.03884058951015723 -737,0.28522238435867453 -738,0.8057900651211597 -739,0.03021709036511122 -740,0.07224489509195386 -741,0.056610587902518716 -742,0.9264467821014194 -743,0.8138662549320123 -744,0.41783822642927937 -745,0.8723047253359363 -746,0.18136207963463802 -747,0.7164025688996778 -748,0.8196872616954788 -749,0.8068822585021751 -750,0.007129291396152926 -751,0.2602504030386925 -752,0.46370562857123043 -753,0.163784347412389 -754,0.23315134483036648 -755,0.6177440123966893 -756,0.2561521510607473 -757,0.562548076892661 -758,0.5051861935336659 -759,0.13892890236963107 -760,0.004539613445676105 -761,0.17372524036846493 -762,0.6832015932759417 -763,0.8325857535808265 -764,6.826981312790803e-05 -765,0.19612584863473537 -766,0.4145509719106246 -767,0.2619625834737831 -768,0.24549665294458467 -769,0.27612714237335956 -770,0.8531795517703349 -771,0.047146001044882424 -772,0.562788499298586 -773,0.43099863376962144 -774,0.26050958743406505 -775,0.7788002061420074 -776,0.6743332176478016 -777,0.40066992822420555 -778,0.9760876856806906 -779,0.539119034171984 -780,0.18208901259127885 -781,0.12376735142175199 -782,0.9551514655114575 -783,0.7810294736400567 -784,0.9212583468427701 -785,0.8010043139785669 -786,0.22944051406680832 -787,0.050052241727377766 -788,0.6786745563768194 -789,0.429793629888368 -790,0.42563361699182967 -791,0.6784838537337905 -792,0.2858761720399675 -793,0.2890895011305119 -794,0.025121632825633844 -795,0.25765509253553054 -796,0.43572322499776717 -797,0.6647102169428171 -798,0.10847616026636064 -799,0.2537450603718995 -800,0.24416864473064126 -801,0.0672514263787497 -802,0.16935229953659314 -803,0.27439580112524253 -804,0.4284736191801598 -805,0.8586734606964571 -806,0.4315781202007021 -807,0.09915635234890208 -808,0.44899905032025744 -809,0.013316716483281699 -810,0.8391449274551819 -811,0.5061770521104294 -812,0.0672045714638001 -813,0.2933544809181752 -814,0.18022127393582965 -815,0.8781136361676581 -816,0.5157135259800142 -817,0.46243072336418334 -818,0.6222491687600095 -819,0.8889053056935484 -820,0.04571095891205823 -821,0.1513640763692672 -822,0.7774449453314359 -823,0.5183880690457242 -824,0.2921720252636122 -825,0.09168278609192515 -826,0.39002371887786735 -827,0.3580585061283823 -828,0.12047021435718164 -829,0.6738337221623005 -830,0.21958552211366156 -831,0.5648142473736366 -832,0.23497653874753555 -833,0.16544595712611387 -834,0.040561694693181605 -835,0.7355715205459343 -836,0.9004365787736869 -837,0.5459151013055901 -838,0.7480058346265005 -839,0.7141260383574005 -840,0.1158157631511092 -841,0.9125379342891712 -842,0.3680018768100638 -843,0.7402206231811581 -844,0.2972738079840226 -845,0.8923504613507662 -846,0.5063568640229354 -847,0.24619949696371157 -848,0.5399981903000146 -849,0.7188539530946122 -850,0.648195890336554 -851,0.724518894463568 -852,0.14288147919479144 -853,0.7994514226699949 -854,0.6226355760247099 -855,0.010176035425188967 -856,0.4131692686695717 -857,0.834692399566853 -858,0.49912957372925004 -859,0.00438814293685974 -860,0.3252041908817417 -861,0.534840233118543 -862,0.3587118743837924 -863,0.9677560902733098 -864,0.5973183201684436 -865,0.296691425381007 -866,0.5855079326424412 -867,0.20240300955532187 -868,0.6021550529096645 -869,0.8824421051967469 -870,0.3072946199859422 -871,0.3128979438155097 -872,0.5475105438225643 -873,0.4842448962628426 -874,0.15025538438496855 -875,0.310622456701922 -876,0.6023436011138587 -877,0.5754165898365287 -878,0.6577607923072721 -879,0.7857515237431592 -880,0.22057576301022253 -881,0.8661095076438114 -882,0.910244039608377 -883,0.578456971142587 -884,0.3787935162597653 -885,0.08939098828841929 -886,0.9232626564888574 -887,0.1712490756353049 -888,0.779216672902944 -889,0.3495372334946847 -890,0.47001887737996617 -891,0.29750226759355936 -892,0.2810128485470573 -893,0.2437794575755069 -894,0.2624381305719474 -895,0.8246608579175856 -896,0.6942956761673141 -897,0.11515579868519688 -898,0.1206162339748359 -899,0.26196220525263014 -900,0.5553026135773536 -901,0.40720637901420265 -902,0.9638145298530792 -903,0.4117628415691498 -904,0.31618951259604455 -905,0.11765701103218917 -906,0.33470652854411564 -907,0.7366235956449027 -908,0.7581529716898141 -909,0.9554767313213507 -910,0.8837680591214232 -911,0.12426303151941864 -912,0.13192594906673982 -913,0.13159583337236658 -914,0.8413301780622977 -915,0.5495370639785346 -916,0.8125566245605387 -917,0.764454058143039 -918,0.9022709587116715 -919,0.22879685531861071 -920,0.49057430203325403 -921,0.4724960647844604 -922,0.8055598260756343 -923,0.7603094118394911 -924,0.3728373302689516 -925,0.3568389711535207 -926,0.4241494594670866 -927,0.7538918294606227 -928,0.5278021541536974 -929,0.4605573424438759 -930,0.6738635250250887 -931,0.16054005910324365 -932,0.8428762894592794 -933,0.9518468101445031 -934,0.32776599980321264 -935,0.3459454626103713 -936,0.08290510118997685 -937,0.4134429089919419 -938,0.7577633137424186 -939,0.4360752405153524 -940,0.977898855124461 -941,0.3899549115493246 -942,0.07360874043480192 -943,0.6234394805204561 -944,0.8281399000229284 -945,0.5936401403938281 -946,0.9444301233719021 -947,0.18311569423561358 -948,0.19900897833219744 -949,0.5859537329420677 -950,0.45369641243149117 -951,0.8140494291811821 -952,0.15504116789135103 -953,0.5097058344234562 -954,0.46015129255339193 -955,0.9168374769143446 -956,0.6646855362668478 -957,0.08710995188842596 -958,0.9648211892689712 -959,0.3099412950871465 -960,0.4182764603873177 -961,0.2811470272374724 -962,0.36150098707209977 -963,0.7547921114548144 -964,0.038441021458981206 -965,0.6114605284345398 -966,0.20333754648264146 -967,0.6879693726518868 -968,0.5615887399000671 -969,0.10931708773465398 -970,0.8275712918793767 -971,0.7747109160797243 -972,0.9005913428689535 -973,0.6399242580079716 -974,0.717434307883715 -975,0.0782758727785875 -976,0.05968847507483932 -977,0.9824576958211914 -978,0.02495988725135534 -979,0.2620968894854523 -980,0.010107863826380292 -981,0.2764875736254404 -982,0.18403412415931986 -983,0.1616789092290818 -984,0.3454521050417132 -985,0.433499552863608 -986,0.040911884966301715 -987,0.20484238883308725 -988,0.6675520566953549 -989,0.6160709258598361 -990,0.04474552091720452 -991,0.40241951588041347 -992,0.5873473825076658 -993,0.38212818142632543 -994,0.8770948644179681 -995,0.18210726703943658 -996,0.7879879363150989 -997,0.14870738186047538 -998,0.15312132054135852 -999,0.4747372545447177 From 585e412916ccca9f6a64cd3e840c1e5106b5bb2d Mon Sep 17 00:00:00 2001 From: Shesh Narayan Gupta <91396937+SheshNGupta@users.noreply.github.com> Date: Mon, 6 Jun 2022 18:49:39 -0400 Subject: [PATCH 4/8] Create readme.md --- notebooks/Imputation_best_practices/readme.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 notebooks/Imputation_best_practices/readme.md diff --git a/notebooks/Imputation_best_practices/readme.md b/notebooks/Imputation_best_practices/readme.md new file mode 100644 index 0000000..b4d57c1 --- /dev/null +++ b/notebooks/Imputation_best_practices/readme.md @@ -0,0 +1 @@ +This folder will contain the notebook and the data used for demonstrating how to effectively use imputation practices using KNN and mean imputations From 7e578f5b9a9171628746e09c987e158e01737f9f Mon Sep 17 00:00:00 2001 From: Shesh Narayan Gupta <91396937+SheshNGupta@users.noreply.github.com> Date: Mon, 6 Jun 2022 18:50:26 -0400 Subject: [PATCH 5/8] imputation best practices --- .../Imputation_best_practices.ipynb | 4557 +++++++++++++++++ .../random_numbers_1000.csv | 1001 ++++ 2 files changed, 5558 insertions(+) create mode 100644 notebooks/Imputation_best_practices/Imputation_best_practices.ipynb create mode 100644 notebooks/Imputation_best_practices/random_numbers_1000.csv diff --git a/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb b/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb new file mode 100644 index 0000000..87d582d --- /dev/null +++ b/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb @@ -0,0 +1,4557 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "e2ceaeb0-e282-4c63-97e2-f1dd03810aa2", + "metadata": {}, + "source": [ + "# What to try in this notebook?\n", + "\n", + "#### 1. Get a random number generated dataset from kaggle, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", + "\n", + "Dataset - https://www.kaggle.com/timoboz/random-numbers\n", + "\n", + "#### 2. Use a housing dataset from UCI, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", + "\n", + "Dataset - https://github.com/nikbearbrown/AI_Research_Group/blob/main/Awesome-UCI-Datasets/Classification/House_Price_predication/train.csv" + ] + }, + { + "cell_type": "code", + "execution_count": 101, + "id": "d8fe4103-6e71-4b97-810c-b599a0482944", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "from sklearn.impute import KNNImputer\n", + "from sklearn.preprocessing import MinMaxScaler" + ] + }, + { + "cell_type": "markdown", + "id": "f95427ef-d6bc-47b8-a516-45a05b238180", + "metadata": {}, + "source": [ + "# 1.1 Random Numbers dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "id": "03fc0415-cdd2-415b-a273-08037b06afcf", + "metadata": {}, + "outputs": [], + "source": [ + "random_dataset = pd.read_csv('random_numbers_1000.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 103, + "id": "5ea97930-03cd-48ff-97b9-97e9cd9dde55", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Unnamed: 0number
7827820.955151
3783780.310217
5425420.607177
80800.861696
2822820.204316
9769760.059688
9249240.372837
3293290.406915
1311310.402420
6076070.078909
\n", + "
" + ], + "text/plain": [ + " Unnamed: 0 number\n", + "782 782 0.955151\n", + "378 378 0.310217\n", + "542 542 0.607177\n", + "80 80 0.861696\n", + "282 282 0.204316\n", + "976 976 0.059688\n", + "924 924 0.372837\n", + "329 329 0.406915\n", + "131 131 0.402420\n", + "607 607 0.078909" + ] + }, + "execution_count": 103, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "random_dataset.sample(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 104, + "id": "f19e199b-91aa-4e03-9e07-37f5a574d481", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 1000 entries, 0 to 999\n", + "Data columns (total 2 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 Unnamed: 0 1000 non-null int64 \n", + " 1 number 1000 non-null float64\n", + "dtypes: float64(1), int64(1)\n", + "memory usage: 15.8 KB\n" + ] + } + ], + "source": [ + "random_dataset.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 105, + "id": "382f0f03-b3f4-4244-a95c-e78476fae2ca", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "count 1000.000000\n", + "mean 0.490463\n", + "std 0.284669\n", + "min 0.000068\n", + "25% 0.252124\n", + "50% 0.479825\n", + "75% 0.735584\n", + "max 0.997610\n", + "Name: number, dtype: float64" + ] + }, + "execution_count": 105, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "random_dataset['number'].describe()" + ] + }, + { + "cell_type": "markdown", + "id": "348a0b85-c450-4d5d-a9d2-c57c95964b42", + "metadata": {}, + "source": [ + "#### Create 3 col. for numbers for 1%, 5% and 10% missing data" + ] + }, + { + "cell_type": "code", + "execution_count": 106, + "id": "f5de26b3-17b7-463b-98e4-147a457ca37e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
00.1446160.1446160.1446160.144616
10.0775150.0775150.0775150.077515
20.1559330.1559330.1559330.155933
30.0972090.0972090.0972090.097209
40.3237500.3237500.3237500.323750
...............
9950.1821070.1821070.1821070.182107
9960.7879880.7879880.7879880.787988
9970.1487070.1487070.1487070.148707
9980.1531210.1531210.1531210.153121
9990.4747370.4747370.4747370.474737
\n", + "

1000 rows × 4 columns

\n", + "
" + ], + "text/plain": [ + " number number_copy_1_percent number_copy_5_percent \\\n", + "0 0.144616 0.144616 0.144616 \n", + "1 0.077515 0.077515 0.077515 \n", + "2 0.155933 0.155933 0.155933 \n", + "3 0.097209 0.097209 0.097209 \n", + "4 0.323750 0.323750 0.323750 \n", + ".. ... ... ... \n", + "995 0.182107 0.182107 0.182107 \n", + "996 0.787988 0.787988 0.787988 \n", + "997 0.148707 0.148707 0.148707 \n", + "998 0.153121 0.153121 0.153121 \n", + "999 0.474737 0.474737 0.474737 \n", + "\n", + " number_copy_10_percent \n", + "0 0.144616 \n", + "1 0.077515 \n", + "2 0.155933 \n", + "3 0.097209 \n", + "4 0.323750 \n", + ".. ... \n", + "995 0.182107 \n", + "996 0.787988 \n", + "997 0.148707 \n", + "998 0.153121 \n", + "999 0.474737 \n", + "\n", + "[1000 rows x 4 columns]" + ] + }, + "execution_count": 106, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_number = random_dataset[['number']]\n", + "df_number['number_copy_1_percent'] = df_number[['number']]\n", + "df_number['number_copy_5_percent'] = df_number[['number']]\n", + "df_number['number_copy_10_percent'] = df_number[['number']]\n", + "df_number" + ] + }, + { + "cell_type": "markdown", + "id": "1ff95002-46a0-454b-97c1-6c189153d459", + "metadata": {}, + "source": [ + "#### Check % missing values in this dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 107, + "id": "35c38775-26d9-4b1e-97a9-4c46c0d5d92b", + "metadata": {}, + "outputs": [], + "source": [ + "def get_percent_missing(dataframe):\n", + " \n", + " percent_missing = dataframe.isnull().sum() * 100 / len(dataframe)\n", + " missing_value_df = pd.DataFrame({'column_name': dataframe.columns,\n", + " 'percent_missing': percent_missing})\n", + " return missing_value_df" + ] + }, + { + "cell_type": "code", + "execution_count": 108, + "id": "6837b7e5-4444-4914-9c0e-a9cefd2c7b6f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 0.0\n", + "number_copy_5_percent number_copy_5_percent 0.0\n", + "number_copy_10_percent number_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_number))" + ] + }, + { + "cell_type": "markdown", + "id": "25318ebf-b1bf-4f4b-ba1d-011b27a27f39", + "metadata": {}, + "source": [ + "#### Create missing helper fn" + ] + }, + { + "cell_type": "code", + "execution_count": 109, + "id": "76da9076-d9c8-417e-bcfc-8ce7066d1a53", + "metadata": {}, + "outputs": [], + "source": [ + "def create_missing(dataframe, percent, col):\n", + " dataframe.loc[dataframe.sample(frac = percent).index, col] = np.nan" + ] + }, + { + "cell_type": "markdown", + "id": "9dc43e57-be39-4efe-8131-d6a3423b8d77", + "metadata": {}, + "source": [ + "#### Create missing data in each col" + ] + }, + { + "cell_type": "code", + "execution_count": 110, + "id": "6e8ab693-6043-4ade-b62a-9b3fc9ebf735", + "metadata": {}, + "outputs": [], + "source": [ + "create_missing(df_number, 0.01, 'number_copy_1_percent')\n", + "create_missing(df_number, 0.05, 'number_copy_5_percent')\n", + "create_missing(df_number, 0.1, 'number_copy_10_percent')" + ] + }, + { + "cell_type": "markdown", + "id": "655cb92a-6b63-4498-9c31-d63f11145569", + "metadata": {}, + "source": [ + "#### Check % missing after removing data" + ] + }, + { + "cell_type": "code", + "execution_count": 111, + "id": "412518b5-67ec-4a5a-9720-4a0ce7657d44", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 1.0\n", + "number_copy_5_percent number_copy_5_percent 5.0\n", + "number_copy_10_percent number_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_number))" + ] + }, + { + "cell_type": "markdown", + "id": "6876e3fc-b878-4560-a3a4-72c36f2a422e", + "metadata": {}, + "source": [ + "#### Store the indices of missing rows" + ] + }, + { + "cell_type": "code", + "execution_count": 112, + "id": "c1860270-add6-4963-9aef-27ef1e171fca", + "metadata": {}, + "outputs": [], + "source": [ + "# Store Index of NaN values in each coloumns\n", + "number_1_idx = list(np.where(df_number['number_copy_1_percent'].isna())[0])\n", + "number_5_idx = list(np.where(df_number['number_copy_5_percent'].isna())[0])\n", + "number_10_idx = list(np.where(df_number['number_copy_10_percent'].isna())[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 113, + "id": "57841da6-b453-40cc-8ecc-702fe4613a74", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Length of number_1_idx is 10 and it contains 1.0% of total data in column | Total rows: 1000\n", + "Length of number_5_idx is 50 and it contains 5.0% of total data in column | Total rows: 1000\n", + "Length of number_10_idx is 100 and it contains 10.0% of total data in column | Total rows: 1000\n" + ] + } + ], + "source": [ + "print(f\"Length of number_1_idx is {len(number_1_idx)} and it contains {(len(number_1_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")\n", + "print(f\"Length of number_5_idx is {len(number_5_idx)} and it contains {(len(number_5_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")\n", + "print(f\"Length of number_10_idx is {len(number_10_idx)} and it contains {(len(number_10_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")" + ] + }, + { + "cell_type": "markdown", + "id": "47469d0b-a8f3-4469-b18c-3a457f7dc373", + "metadata": {}, + "source": [ + "### Perform KNN impute to df_number dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 114, + "id": "b09c6c85-4ce3-4aeb-bb81-6a698494a58e", + "metadata": {}, + "outputs": [], + "source": [ + "df_number1 = df_number.copy(deep=True)\n", + "imputer = KNNImputer(n_neighbors=5)\n", + "imputed_number_df = pd.DataFrame(imputer.fit_transform(df_number1), columns = df_number1.columns)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 115, + "id": "2f051a7d-3ebd-4839-aae0-ef125944d613", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
3470.3723890.3723890.3723890.372389
9340.3277660.3277660.3277660.327766
9270.7538920.7538920.7538920.753892
9970.1487070.1487070.1487070.148707
1670.7309010.7309010.7309010.730901
9140.8413300.8413300.8413300.841330
4320.8974660.8974660.8974660.897466
5870.4116850.4116850.4116850.411685
8840.3787940.3787940.3787940.378794
3790.2654290.2654290.2654290.264843
\n", + "
" + ], + "text/plain": [ + " number number_copy_1_percent number_copy_5_percent \\\n", + "347 0.372389 0.372389 0.372389 \n", + "934 0.327766 0.327766 0.327766 \n", + "927 0.753892 0.753892 0.753892 \n", + "997 0.148707 0.148707 0.148707 \n", + "167 0.730901 0.730901 0.730901 \n", + "914 0.841330 0.841330 0.841330 \n", + "432 0.897466 0.897466 0.897466 \n", + "587 0.411685 0.411685 0.411685 \n", + "884 0.378794 0.378794 0.378794 \n", + "379 0.265429 0.265429 0.265429 \n", + "\n", + " number_copy_10_percent \n", + "347 0.372389 \n", + "934 0.327766 \n", + "927 0.753892 \n", + "997 0.148707 \n", + "167 0.730901 \n", + "914 0.841330 \n", + "432 0.897466 \n", + "587 0.411685 \n", + "884 0.378794 \n", + "379 0.264843 " + ] + }, + "execution_count": 115, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imputed_number_df.sample(10)" + ] + }, + { + "cell_type": "markdown", + "id": "ddc79a45-bd2b-44f3-a3c4-aaefa73b43d9", + "metadata": {}, + "source": [ + "#### Check the % missing data in dataframe now" + ] + }, + { + "cell_type": "code", + "execution_count": 116, + "id": "5c98d450-bf5a-46e5-9091-c6a1202a2611", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 0.0\n", + "number_copy_5_percent number_copy_5_percent 0.0\n", + "number_copy_10_percent number_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(imputed_number_df))" + ] + }, + { + "cell_type": "markdown", + "id": "f14476bf-29e6-4d9a-9cd4-9dd56a53b466", + "metadata": {}, + "source": [ + "#### Store the list of differences between org. and Imputed value" + ] + }, + { + "cell_type": "code", + "execution_count": 117, + "id": "3f096800-dc6e-4455-a9e6-2db18884e5ee", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "number_diff_1 = []\n", + "number_diff_5 = []\n", + "number_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in number_1_idx:\n", + " count +=1\n", + " diff1 = abs(imputed_number_df['number_copy_1_percent'][i] - df_number1['number'][i])\n", + " number_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in number_5_idx:\n", + " diff5 = abs(imputed_number_df['number_copy_5_percent'][i] - df_number1['number'][i])\n", + " number_diff_5.append(diff5)\n", + "\n", + "for i in number_10_idx:\n", + " diff10 = abs(imputed_number_df['number_copy_10_percent'][i] - df_number1['number'][i])\n", + " number_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 118, + "id": "4a2c29fc-99f3-4624-808e-437d3983cabb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(number_diff_1))\n", + "print(len(number_diff_5))\n", + "print(len(number_diff_10))" + ] + }, + { + "cell_type": "markdown", + "id": "4ec4adbe-5571-40e3-90ba-92cb431161ca", + "metadata": {}, + "source": [ + "### Calculate the mean and varience of list of differences KNN" + ] + }, + { + "cell_type": "code", + "execution_count": 119, + "id": "1163cb62-9dc4-427e-b5cf-20bf3e16d79b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.0007902710470742466 and varience 1% is 4.5687016451605466e-07\n", + "The mean of 5% is 0.000675654857997236 and varience 5% is 3.072444468179742e-07\n", + "The mean of 10% is 0.000675654857997236 and varience 10% is 2.480608628449602e-07\n" + ] + } + ], + "source": [ + "m1 = sum(number_diff_1) / len(number_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in number_diff_1) / len(number_diff_1)\n", + "\n", + "m5 = sum(number_diff_5) / len(number_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in number_diff_5) / len(number_diff_5)\n", + "\n", + "\n", + "m10 = sum(number_diff_10) / len(number_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in number_diff_10) / len(number_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 120, + "id": "6987d059-7449-44a0-a3c2-8605362a18a0", + "metadata": {}, + "outputs": [], + "source": [ + "df_knn_number = pd.DataFrame.from_dict({'1%_number': [m1, var_res1],\n", + " '5%_number': [m5, var_res5],\n", + " '10%_number': [m10, var_res10]}, orient='index')\n", + "df_knn_number.columns=['diff. list Mean(KNN)', 'diff. list Var.(KNN)']" + ] + }, + { + "cell_type": "markdown", + "id": "41740e20-5dae-403e-a83b-94c91469fcc3", + "metadata": {}, + "source": [ + "### Perform MEAN based imputation" + ] + }, + { + "cell_type": "markdown", + "id": "17b69478-e97c-41b9-828a-eefbb46eb161", + "metadata": {}, + "source": [ + "#### Before mean imputation % missing" + ] + }, + { + "cell_type": "code", + "execution_count": 121, + "id": "5a828216-8f1a-4157-8141-77e6c929f57a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 1.0\n", + "number_copy_5_percent number_copy_5_percent 5.0\n", + "number_copy_10_percent number_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "df_number2 = df_number.copy(deep=True)\n", + "print(get_percent_missing(df_number2))" + ] + }, + { + "cell_type": "code", + "execution_count": 122, + "id": "1e137676-9f01-44b9-8a84-50d03a89436b", + "metadata": {}, + "outputs": [], + "source": [ + "df_number2['number_copy_1_percent'] = df_number2['number_copy_1_percent'].fillna(df_number2['number_copy_1_percent'].mean())\n", + "df_number2['number_copy_5_percent'] = df_number2['number_copy_5_percent'].fillna(df_number2['number_copy_5_percent'].mean())\n", + "df_number2['number_copy_10_percent'] = df_number2['number_copy_10_percent'].fillna(df_number2['number_copy_10_percent'].mean())" + ] + }, + { + "cell_type": "markdown", + "id": "8da82021-d96a-46ac-81df-035977cb5497", + "metadata": {}, + "source": [ + "#### After mean impute % missing " + ] + }, + { + "cell_type": "code", + "execution_count": 123, + "id": "669c14bd-f920-47db-8476-1cd1b4f4f5bb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 0.0\n", + "number_copy_5_percent number_copy_5_percent 0.0\n", + "number_copy_10_percent number_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_number2))" + ] + }, + { + "cell_type": "code", + "execution_count": 124, + "id": "ccb60d18-b24e-4211-9947-46ee0bcc06fe", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
3660.4255250.4255250.4255250.425525
1450.2465890.2465890.2465890.246589
5380.5037010.5037010.5037010.503701
2560.1189010.1189010.4919320.118901
1560.7732150.7732150.7732150.773215
5000.4410870.4410870.4410870.441087
3250.0950680.0950680.0950680.095068
970.2098420.2098420.2098420.487348
9050.1176570.4910840.1176570.117657
2510.9613050.9613050.9613050.961305
\n", + "
" + ], + "text/plain": [ + " number number_copy_1_percent number_copy_5_percent \\\n", + "366 0.425525 0.425525 0.425525 \n", + "145 0.246589 0.246589 0.246589 \n", + "538 0.503701 0.503701 0.503701 \n", + "256 0.118901 0.118901 0.491932 \n", + "156 0.773215 0.773215 0.773215 \n", + "500 0.441087 0.441087 0.441087 \n", + "325 0.095068 0.095068 0.095068 \n", + "97 0.209842 0.209842 0.209842 \n", + "905 0.117657 0.491084 0.117657 \n", + "251 0.961305 0.961305 0.961305 \n", + "\n", + " number_copy_10_percent \n", + "366 0.425525 \n", + "145 0.246589 \n", + "538 0.503701 \n", + "256 0.118901 \n", + "156 0.773215 \n", + "500 0.441087 \n", + "325 0.095068 \n", + "97 0.487348 \n", + "905 0.117657 \n", + "251 0.961305 " + ] + }, + "execution_count": 124, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_number2.sample(10)" + ] + }, + { + "cell_type": "markdown", + "id": "88d89795-0ae9-4f37-89cd-b24d36658588", + "metadata": {}, + "source": [ + "#### Create a list of difference - MEAN" + ] + }, + { + "cell_type": "code", + "execution_count": 125, + "id": "530979d5-52c4-473d-95f3-754c460a7ab6", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "number_diff_1_mean = []\n", + "number_diff_5_mean = []\n", + "number_diff_10_mean = []\n", + "count = 0\n", + "\n", + "for i in number_1_idx:\n", + " count +=1\n", + " diff1 = abs(df_number2['number_copy_1_percent'][i] - df_number2['number'][i])\n", + " number_diff_1_mean.append(diff1)\n", + " \n", + "\n", + "for i in number_5_idx:\n", + " diff5 = abs(df_number2['number_copy_5_percent'][i] - df_number2['number'][i])\n", + " number_diff_5_mean.append(diff5)\n", + "\n", + "for i in number_10_idx:\n", + " diff10 = abs(df_number2['number_copy_10_percent'][i] - df_number2['number'][i])\n", + " number_diff_10_mean.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 126, + "id": "28dd2494-0175-431e-b4b7-09ee4af1f6a0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(number_diff_1_mean))\n", + "print(len(number_diff_5_mean))\n", + "print(len(number_diff_10_mean))" + ] + }, + { + "cell_type": "markdown", + "id": "4e90251e-4c0a-4e2d-82b1-8764374aed1c", + "metadata": {}, + "source": [ + "### Calculate the mean and var of the list of differences - MEAN Impute" + ] + }, + { + "cell_type": "code", + "execution_count": 127, + "id": "682bd76e-4875-4b4d-b90b-91d8a6e492ae", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.269368727544059 and varience 1% is 0.018130331928686818\n", + "The mean of 5% is 0.18484105170274112 and varience 5% is 0.014920933643125705\n", + "The mean of 10% is 0.18484105170274112 and varience 10% is 0.020023889816061954\n" + ] + } + ], + "source": [ + "m1 = sum(number_diff_1_mean) / len(number_diff_1_mean)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in number_diff_1_mean) / len(number_diff_1_mean)\n", + "\n", + "m5 = sum(number_diff_5_mean) / len(number_diff_5_mean)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in number_diff_5_mean) / len(number_diff_5_mean)\n", + "\n", + "\n", + "m10 = sum(number_diff_10_mean) / len(number_diff_10_mean)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in number_diff_10_mean) / len(number_diff_10_mean)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 128, + "id": "1f41880d-3e7d-48c9-8744-7e47ccae3c17", + "metadata": {}, + "outputs": [], + "source": [ + "df_MI_number = pd.DataFrame.from_dict({'1%_number': [m1, var_res1],\n", + " '5%_number': [m5, var_res5],\n", + " '10%_number': [m10, var_res10]}, orient='index')\n", + "df_MI_number.columns=['diff. list Mean(MI)', 'diff. list Var.(MI)']" + ] + }, + { + "cell_type": "markdown", + "id": "ec64b079-db97-429c-ae3a-519eec91db3f", + "metadata": {}, + "source": [ + "## KNN and MEAN columns side by side" + ] + }, + { + "cell_type": "code", + "execution_count": 129, + "id": "d74b0e73-e3f0-4107-806d-c5d5a50aab9a", + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import display_html\n", + "from itertools import chain,cycle\n", + "def display_side_by_side(*args,titles=cycle([''])):\n", + " html_str=''\n", + " for df,title in zip(args, chain(titles,cycle(['
'])) ):\n", + " html_str+=''\n", + " html_str+=f'

{title}

'\n", + " html_str+=df.to_html().replace('table','table style=\"display:inline\"')\n", + " html_str+=''\n", + " display_html(html_str,raw=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 130, + "id": "747a487f-cbc4-467a-9bc7-b0856dbb6576", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 130, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from IPython.display import display, HTML\n", + "\n", + "CSS = \"\"\"\n", + ".output {\n", + " flex-direction: row;\n", + "}\n", + "\"\"\"\n", + "\n", + "HTML(''.format(CSS))" + ] + }, + { + "cell_type": "code", + "execution_count": 131, + "id": "d24551d1-cd58-4a41-8262-873fe5034272", + "metadata": {}, + "outputs": [], + "source": [ + "# https://github.com/epmoyer/ipy_table/issues/24\n", + "\n", + "from IPython.core.display import HTML\n", + "\n", + "def multi_table(table_list):\n", + " ''' Acceps a list of IpyTable objects and returns a table which contains each IpyTable in a cell\n", + " '''\n", + " return HTML(\n", + " '' + \n", + " ''.join(['' for table in table_list]) +\n", + " '
' + table._repr_html_() + '
'\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 132, + "id": "8a8daa30-3abf-4315-ae58-f9171ff000d5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[124, 257, 309, 313, 405]\n" + ] + } + ], + "source": [ + "print(number_1_idx[:5])" + ] + }, + { + "cell_type": "code", + "execution_count": 133, + "id": "da6b1646-2417-42b7-bc8f-d3b0be85c61b", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1 = imputed_number_df.loc[:, [\"number\", \"number_copy_1_percent\"]]\n", + "compare_5 = imputed_number_df.loc[:, [\"number\", \"number_copy_5_percent\"]]\n", + "compare_10 = imputed_number_df.loc[:, [\"number\", \"number_copy_10_percent\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": 134, + "id": "380b94cf-264f-4a41-bb1d-ac272354073f", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1_df = compare_1.iloc[number_1_idx]\n", + "compare_5_df = compare_5.iloc[number_5_idx]\n", + "compare_10_df = compare_10.iloc[number_10_idx]" + ] + }, + { + "cell_type": "code", + "execution_count": 135, + "id": "e5b21e71-0ddd-4c60-b931-b384d65230dd", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1_mean = df_number2.loc[:, [\"number\", \"number_copy_1_percent\"]]\n", + "compare_5_mean = df_number2.loc[:, [\"number\", \"number_copy_5_percent\"]]\n", + "compare_10_mean = df_number2.loc[:, [\"number\", \"number_copy_10_percent\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": 136, + "id": "29be3554-8129-4f0c-bad6-1270b7c6c05b", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1_mean_df = compare_1_mean.iloc[number_1_idx]\n", + "compare_5_mean_df = compare_5_mean.iloc[number_5_idx]\n", + "compare_10_mean_df = compare_10_mean.iloc[number_10_idx]" + ] + }, + { + "cell_type": "code", + "execution_count": 137, + "id": "27b96ecc-3566-48f5-bec5-9b073c575cb6", + "metadata": {}, + "outputs": [], + "source": [ + "# display_side_by_side(compare_1_df.head(), compare_1_mean_df.head(), titles=['number 1% KNN Impute','number 1% Mean Impute'])\n", + "# display_side_by_side(compare_5_df.head(), compare_5_mean_df.head(), titles=['number 5% KNN Impute','number 5% Mean Impute'])\n", + "# display_side_by_side(compare_10_df.head(), compare_10_mean_df.head(), titles=['number 10% KNN Impute','number 10% Mean Impute'])" + ] + }, + { + "cell_type": "markdown", + "id": "72a3bc3c-0f91-49ad-bf03-dc4b7ace265d", + "metadata": {}, + "source": [ + "#### **number 1% KNN Impute VS number 1% Mean Impute**" + ] + }, + { + "cell_type": "code", + "execution_count": 138, + "id": "6fd11f89-9f4b-49b3-b114-1ab3b461f180", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percent
1240.1929900.192926
2570.0656020.066172
3090.6614470.663769
3130.9639510.962988
4050.6274600.627545
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percent
1240.1929900.491084
2570.0656020.491084
3090.6614470.491084
3130.9639510.491084
4050.6274600.491084
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 138, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([compare_1_df.head(), compare_1_mean_df.head()])" + ] + }, + { + "cell_type": "markdown", + "id": "e1fc9d1c-53ef-42d3-809b-d68051057e48", + "metadata": {}, + "source": [ + "#### **number 5% KNN Impute VS number 5% Mean Impute**" + ] + }, + { + "cell_type": "code", + "execution_count": 139, + "id": "a97c1530-2e50-48d2-a7e0-89fc70f648e5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_5_percent
540.4401440.439307
590.1896550.191045
720.4114510.412386
780.2051780.204306
1070.3230970.322044
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_5_percent
540.4401440.491932
590.1896550.491932
720.4114510.491932
780.2051780.491932
1070.3230970.491932
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 139, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([compare_5_df.head(), compare_5_mean_df.head()])" + ] + }, + { + "cell_type": "markdown", + "id": "1e732ac9-faf7-4457-baef-ac9c4976598c", + "metadata": {}, + "source": [ + "#### **number 10% KNN Impute VS number 10% Mean Impute**" + ] + }, + { + "cell_type": "code", + "execution_count": 140, + "id": "f2d22e8f-5a0b-48c0-9150-a391d48e93b2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_10_percent
220.7981880.798777
470.8614540.861385
490.4451080.446055
680.5574680.557299
690.2311720.230069
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_10_percent
220.7981880.487348
470.8614540.487348
490.4451080.487348
680.5574680.487348
690.2311720.487348
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 140, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([compare_10_df.head(), compare_10_mean_df.head()])" + ] + }, + { + "cell_type": "markdown", + "id": "cc817314-971f-4abf-a56e-9830a5cf0329", + "metadata": {}, + "source": [ + "# 1.2 Random Numbers dataset Results - KNN and MEAN" + ] + }, + { + "cell_type": "code", + "execution_count": 142, + "id": "1397844d-6757-471c-bd76-ff84d466b150", + "metadata": {}, + "outputs": [], + "source": [ + "results = pd.concat([df_knn_number, df_MI_number])" + ] + }, + { + "cell_type": "code", + "execution_count": 143, + "id": "51868cc7-20f3-499d-a76d-f06f99ea1841", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN)diff. list Var.(KNN)diff. list Mean(MI)diff. list Var.(MI)
1%_number0.0007904.568702e-07NaNNaN
5%_number0.0006763.072444e-07NaNNaN
10%_number0.0006482.480609e-07NaNNaN
1%_numberNaNNaN0.2693690.018130
5%_numberNaNNaN0.1848410.014921
10%_numberNaNNaN0.2315010.020024
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(KNN) diff. list Var.(KNN) diff. list Mean(MI) \\\n", + "1%_number 0.000790 4.568702e-07 NaN \n", + "5%_number 0.000676 3.072444e-07 NaN \n", + "10%_number 0.000648 2.480609e-07 NaN \n", + "1%_number NaN NaN 0.269369 \n", + "5%_number NaN NaN 0.184841 \n", + "10%_number NaN NaN 0.231501 \n", + "\n", + " diff. list Var.(MI) \n", + "1%_number NaN \n", + "5%_number NaN \n", + "10%_number NaN \n", + "1%_number 0.018130 \n", + "5%_number 0.014921 \n", + "10%_number 0.020024 " + ] + }, + "execution_count": 143, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "results" + ] + }, + { + "cell_type": "code", + "execution_count": 144, + "id": "85deaebb-3a2b-4b52-bf80-ce31499a70d8", + "metadata": {}, + "outputs": [], + "source": [ + "results.to_csv('random_num_knn_mean_results.csv')" + ] + }, + { + "cell_type": "markdown", + "id": "08586561-e3a5-4d15-a1c0-b8d71731a84a", + "metadata": {}, + "source": [ + "# 2.1 Housing Dataset " + ] + }, + { + "cell_type": "code", + "execution_count": 361, + "id": "c05f4dd5-4cdc-4617-939a-2e22ec859af1", + "metadata": {}, + "outputs": [], + "source": [ + "housing_data = pd.read_csv('https://raw.githubusercontent.com/nikbearbrown/AI_Research_Group/main/Awesome-UCI-Datasets/Classification/House_Price_predication/train.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 362, + "id": "8564d163-97ce-44da-8d3c-6f8cd9c1d0a1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
82082160RL72.07226PaveNaNIR1LvlAllPub...0NaNNaNNaN062008WDNormal183000
1390139120RL70.09100PaveNaNRegLvlAllPub...0NaNNaNNaN092006WDNormal235000
535536190RL70.07000PaveNaNRegLvlAllPub...0NaNNaNNaN012008WDNormal107500
12361237160RL36.02628PaveNaNRegLvlAllPub...0NaNNaNNaN062010WDNormal175500
1337133830RM153.04118PaveGrvlIR1BnkAllPub...0NaNNaNNaN032006WDNormal52500
67467520RL80.09200PaveNaNRegLvlAllPub...0NaNNaNNaN072008WDNormal140000
60460520RL88.012803PaveNaNIR1LvlAllPub...0NaNNaNNaN092008WDNormal221000
60560660RL85.013600PaveNaNRegLvlAllPub...0NaNNaNNaN0102009WDNormal205000
1218121950RM52.06240PaveNaNRegLvlAllPub...0NaNNaNNaN072006WDNormal80500
88288360RLNaN9636PaveNaNIR1LvlAllPub...0NaNMnPrvNaN0122009WDNormal178000
\n", + "

10 rows × 81 columns

\n", + "
" + ], + "text/plain": [ + " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", + "820 821 60 RL 72.0 7226 Pave NaN IR1 \n", + "1390 1391 20 RL 70.0 9100 Pave NaN Reg \n", + "535 536 190 RL 70.0 7000 Pave NaN Reg \n", + "1236 1237 160 RL 36.0 2628 Pave NaN Reg \n", + "1337 1338 30 RM 153.0 4118 Pave Grvl IR1 \n", + "674 675 20 RL 80.0 9200 Pave NaN Reg \n", + "604 605 20 RL 88.0 12803 Pave NaN IR1 \n", + "605 606 60 RL 85.0 13600 Pave NaN Reg \n", + "1218 1219 50 RM 52.0 6240 Pave NaN Reg \n", + "882 883 60 RL NaN 9636 Pave NaN IR1 \n", + "\n", + " LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal \\\n", + "820 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "1390 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "535 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "1236 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "1337 Bnk AllPub ... 0 NaN NaN NaN 0 \n", + "674 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "604 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "605 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "1218 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "882 Lvl AllPub ... 0 NaN MnPrv NaN 0 \n", + "\n", + " MoSold YrSold SaleType SaleCondition SalePrice \n", + "820 6 2008 WD Normal 183000 \n", + "1390 9 2006 WD Normal 235000 \n", + "535 1 2008 WD Normal 107500 \n", + "1236 6 2010 WD Normal 175500 \n", + "1337 3 2006 WD Normal 52500 \n", + "674 7 2008 WD Normal 140000 \n", + "604 9 2008 WD Normal 221000 \n", + "605 10 2009 WD Normal 205000 \n", + "1218 7 2006 WD Normal 80500 \n", + "882 12 2009 WD Normal 178000 \n", + "\n", + "[10 rows x 81 columns]" + ] + }, + "execution_count": 362, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data.sample(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 363, + "id": "bd81975c-0a21-414b-8e20-3564d35b9f9b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "663" + ] + }, + "execution_count": 363, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['SalePrice'].nunique()" + ] + }, + { + "cell_type": "code", + "execution_count": 364, + "id": "67d1046e-a1ad-412e-a7e8-a0d51729cec7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1073" + ] + }, + "execution_count": 364, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['LotArea'].nunique()" + ] + }, + { + "cell_type": "code", + "execution_count": 365, + "id": "64b05e52-72dc-4f7d-aca3-d043036b4d2f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "count 1460.000000\n", + "mean 180921.195890\n", + "std 79442.502883\n", + "min 34900.000000\n", + "25% 129975.000000\n", + "50% 163000.000000\n", + "75% 214000.000000\n", + "max 755000.000000\n", + "Name: SalePrice, dtype: float64" + ] + }, + "execution_count": 365, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['SalePrice'].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 366, + "id": "b7e9928c-4785-4ee1-8150-cd0fa1ef3325", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "count 1460.000000\n", + "mean 10516.828082\n", + "std 9981.264932\n", + "min 1300.000000\n", + "25% 7553.500000\n", + "50% 9478.500000\n", + "75% 11601.500000\n", + "max 215245.000000\n", + "Name: LotArea, dtype: float64" + ] + }, + "execution_count": 366, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['LotArea'].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 367, + "id": "20149f80-07dc-4eaa-8d0e-7de6612a7dce", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "Id Id 0.000000\n", + "MSSubClass MSSubClass 0.000000\n", + "MSZoning MSZoning 0.000000\n", + "LotFrontage LotFrontage 17.739726\n", + "LotArea LotArea 0.000000\n", + "Street Street 0.000000\n", + "Alley Alley 93.767123\n", + "LotShape LotShape 0.000000\n", + "LandContour LandContour 0.000000\n", + "Utilities Utilities 0.000000\n", + "LotConfig LotConfig 0.000000\n", + "LandSlope LandSlope 0.000000\n", + "Neighborhood Neighborhood 0.000000\n", + "Condition1 Condition1 0.000000\n", + "Condition2 Condition2 0.000000\n", + "BldgType BldgType 0.000000\n", + "HouseStyle HouseStyle 0.000000\n", + "OverallQual OverallQual 0.000000\n", + "OverallCond OverallCond 0.000000\n", + "YearBuilt YearBuilt 0.000000\n", + "YearRemodAdd YearRemodAdd 0.000000\n", + "RoofStyle RoofStyle 0.000000\n", + "RoofMatl RoofMatl 0.000000\n", + "Exterior1st Exterior1st 0.000000\n", + "Exterior2nd Exterior2nd 0.000000\n", + "MasVnrType MasVnrType 0.547945\n", + "MasVnrArea MasVnrArea 0.547945\n", + "ExterQual ExterQual 0.000000\n", + "ExterCond ExterCond 0.000000\n", + "Foundation Foundation 0.000000\n", + "BsmtQual BsmtQual 2.534247\n", + "BsmtCond BsmtCond 2.534247\n", + "BsmtExposure BsmtExposure 2.602740\n", + "BsmtFinType1 BsmtFinType1 2.534247\n", + "BsmtFinSF1 BsmtFinSF1 0.000000\n", + "BsmtFinType2 BsmtFinType2 2.602740\n", + "BsmtFinSF2 BsmtFinSF2 0.000000\n", + "BsmtUnfSF BsmtUnfSF 0.000000\n", + "TotalBsmtSF TotalBsmtSF 0.000000\n", + "Heating Heating 0.000000\n", + "HeatingQC HeatingQC 0.000000\n", + "CentralAir CentralAir 0.000000\n", + "Electrical Electrical 0.068493\n", + "1stFlrSF 1stFlrSF 0.000000\n", + "2ndFlrSF 2ndFlrSF 0.000000\n", + "LowQualFinSF LowQualFinSF 0.000000\n", + "GrLivArea GrLivArea 0.000000\n", + "BsmtFullBath BsmtFullBath 0.000000\n", + "BsmtHalfBath BsmtHalfBath 0.000000\n", + "FullBath FullBath 0.000000\n", + "HalfBath HalfBath 0.000000\n", + "BedroomAbvGr BedroomAbvGr 0.000000\n", + "KitchenAbvGr KitchenAbvGr 0.000000\n", + "KitchenQual KitchenQual 0.000000\n", + "TotRmsAbvGrd TotRmsAbvGrd 0.000000\n", + "Functional Functional 0.000000\n", + "Fireplaces Fireplaces 0.000000\n", + "FireplaceQu FireplaceQu 47.260274\n", + "GarageType GarageType 5.547945\n", + "GarageYrBlt GarageYrBlt 5.547945\n", + "GarageFinish GarageFinish 5.547945\n", + "GarageCars GarageCars 0.000000\n", + "GarageArea GarageArea 0.000000\n", + "GarageQual GarageQual 5.547945\n", + "GarageCond GarageCond 5.547945\n", + "PavedDrive PavedDrive 0.000000\n", + "WoodDeckSF WoodDeckSF 0.000000\n", + "OpenPorchSF OpenPorchSF 0.000000\n", + "EnclosedPorch EnclosedPorch 0.000000\n", + "3SsnPorch 3SsnPorch 0.000000\n", + "ScreenPorch ScreenPorch 0.000000\n", + "PoolArea PoolArea 0.000000\n", + "PoolQC PoolQC 99.520548\n", + "Fence Fence 80.753425\n", + "MiscFeature MiscFeature 96.301370\n", + "MiscVal MiscVal 0.000000\n", + "MoSold MoSold 0.000000\n", + "YrSold YrSold 0.000000\n", + "SaleType SaleType 0.000000\n", + "SaleCondition SaleCondition 0.000000\n", + "SalePrice SalePrice 0.000000\n" + ] + } + ], + "source": [ + "pd.set_option('display.max_rows', None)\n", + "print(get_percent_missing(housing_data))" + ] + }, + { + "cell_type": "markdown", + "id": "c8eb3ee3-085d-4b41-9a5f-c83a3805f870", + "metadata": {}, + "source": [ + "#### Using Sale price coloumn for KNN and MEAN imputation task" + ] + }, + { + "cell_type": "markdown", + "id": "451c79fb-17ba-40ac-8f0b-87a8b2ec4837", + "metadata": {}, + "source": [ + "#### Non Scaled dataframe Sale Price - take first 1000 rows" + ] + }, + { + "cell_type": "code", + "execution_count": 368, + "id": "9cc1f97f-1b24-4570-8f6a-30426bd79269", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
0208500208500208500208500
1181500181500181500181500
2223500223500223500223500
3140000140000140000140000
4250000250000250000250000
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 208500 208500 208500 208500\n", + "1 181500 181500 181500 181500\n", + "2 223500 223500 223500 223500\n", + "3 140000 140000 140000 140000\n", + "4 250000 250000 250000 250000" + ] + }, + "execution_count": 368, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice = housing_data[['SalePrice']][:1000]\n", + "df_saleprice['sp_copy_1_percent'] = df_saleprice[['SalePrice']]\n", + "df_saleprice['sp_copy_5_percent'] = df_saleprice[['SalePrice']]\n", + "df_saleprice['sp_copy_10_percent'] = df_saleprice[['SalePrice']]\n", + "df_saleprice.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 369, + "id": "f462f065-9f37-44f1-a22e-92e610dae2e9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1000" + ] + }, + "execution_count": 369, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(df_saleprice)" + ] + }, + { + "cell_type": "markdown", + "id": "03407bbd-f8a7-4f6c-a7c3-64a865ed3f7e", + "metadata": {}, + "source": [ + "#### Scaled Dataframe SalePrice - take first 1000 rows" + ] + }, + { + "cell_type": "code", + "execution_count": 370, + "id": "e461b1ef-df2c-410f-aea8-abe954fa9afd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
00.2410780.2410780.2410780.241078
10.2035830.2035830.2035830.203583
20.2619080.2619080.2619080.261908
30.1459520.1459520.1459520.145952
40.2987090.2987090.2987090.298709
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 0.241078 0.241078 0.241078 0.241078\n", + "1 0.203583 0.203583 0.203583 0.203583\n", + "2 0.261908 0.261908 0.261908 0.261908\n", + "3 0.145952 0.145952 0.145952 0.145952\n", + "4 0.298709 0.298709 0.298709 0.298709" + ] + }, + "execution_count": 370, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scaler = MinMaxScaler()\n", + "df_saleprice_scaled = df_saleprice.copy(deep=True)\n", + "df_saleprice_scaled = pd.DataFrame(scaler.fit_transform(df_saleprice_scaled), columns = df_saleprice_scaled.columns)\n", + "df_saleprice_scaled.head()" + ] + }, + { + "cell_type": "markdown", + "id": "a66683c4-f66a-4aa1-ab8a-f28087b60b6c", + "metadata": {}, + "source": [ + "#### Check % missing values in this dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 371, + "id": "0075fa0f-4b82-4089-ab81-e5282497c4a3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice))" + ] + }, + { + "cell_type": "markdown", + "id": "619ef99f-55c0-422c-aaa8-73cd71fcf2fb", + "metadata": {}, + "source": [ + "#### Create 1%, 5% and 10% missing data" + ] + }, + { + "cell_type": "code", + "execution_count": 372, + "id": "82df5098-4176-4fba-922f-ca84c0466f2a", + "metadata": {}, + "outputs": [], + "source": [ + "create_missing(df_saleprice, 0.01, 'sp_copy_1_percent')\n", + "create_missing(df_saleprice, 0.05, 'sp_copy_5_percent')\n", + "create_missing(df_saleprice, 0.1, 'sp_copy_10_percent')" + ] + }, + { + "cell_type": "code", + "execution_count": 373, + "id": "0e90ae04-cd10-4507-a851-c187010f0be0", + "metadata": {}, + "outputs": [], + "source": [ + "create_missing(df_saleprice_scaled, 0.01, 'sp_copy_1_percent')\n", + "create_missing(df_saleprice_scaled, 0.05, 'sp_copy_5_percent')\n", + "create_missing(df_saleprice_scaled, 0.1, 'sp_copy_10_percent')" + ] + }, + { + "cell_type": "markdown", + "id": "a8237a82-5a33-4ce9-b4c7-a48ede4f5fef", + "metadata": {}, + "source": [ + "#### With/Without scaling dataframe missing values check" + ] + }, + { + "cell_type": "code", + "execution_count": 374, + "id": "2794306d-89c7-4518-8979-9edb3d9441b1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice))" + ] + }, + { + "cell_type": "code", + "execution_count": 375, + "id": "8351dbe2-b388-451d-9238-52c4ccabd425", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice_scaled))" + ] + }, + { + "cell_type": "code", + "execution_count": 376, + "id": "b11b093f-110b-4ef3-9d00-ac4fed45a956", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10" + ] + }, + "execution_count": 376, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice['sp_copy_1_percent'].isna().sum()" + ] + }, + { + "cell_type": "markdown", + "id": "360e0010-e085-435c-8902-80c6a7ea78be", + "metadata": {}, + "source": [ + "#### Store indices of missing values" + ] + }, + { + "cell_type": "code", + "execution_count": 377, + "id": "e546096c-ce35-448e-aa97-0943d3535a87", + "metadata": {}, + "outputs": [], + "source": [ + "# Store Index of NaN values in each coloumns\n", + "sp_1_idx = list(np.where(df_saleprice['sp_copy_1_percent'].isna())[0])\n", + "sp_5_idx = list(np.where(df_saleprice['sp_copy_5_percent'].isna())[0])\n", + "sp_10_idx = list(np.where(df_saleprice['sp_copy_10_percent'].isna())[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 378, + "id": "d409e2a5-b3a9-4ae1-9b17-88b7c642692d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_1_idx))\n", + "print(len(sp_5_idx))\n", + "print(len(sp_10_idx))" + ] + }, + { + "cell_type": "code", + "execution_count": 379, + "id": "5839460a-e736-42e9-9a13-d5bab5683115", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Length of sp_1_idx is 10 and it contains 1.0% of total data in column | Total rows: 1000\n", + "Length of sp_5_idx is 50 and it contains 5.0% of total data in column | Total rows: 1000\n", + "Length of sp_10_idx is 100 and it contains 10.0% of total data in column | Total rows: 1000\n" + ] + } + ], + "source": [ + "print(f\"Length of sp_1_idx is {len(sp_1_idx)} and it contains {(len(sp_1_idx)/len(df_saleprice['sp_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")\n", + "print(f\"Length of sp_5_idx is {len(sp_5_idx)} and it contains {(len(sp_5_idx)/len(df_saleprice['sp_copy_5_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")\n", + "print(f\"Length of sp_10_idx is {len(sp_10_idx)} and it contains {(len(sp_10_idx)/len(df_saleprice['sp_copy_10_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")" + ] + }, + { + "cell_type": "markdown", + "id": "c1464c79-c0a9-4640-92dd-f0d5131634ab", + "metadata": {}, + "source": [ + "### Perform KNN to df_saleprice and df_saleprice_scaled dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 380, + "id": "08fa2436-ffb8-4b5d-a7a1-9e2d63b14562", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice1 = df_saleprice.copy(deep=True)\n", + "imputer = KNNImputer(n_neighbors=5)\n", + "imputed_saleprice_df = pd.DataFrame(imputer.fit_transform(df_saleprice1), columns = df_saleprice1.columns)" + ] + }, + { + "cell_type": "code", + "execution_count": 381, + "id": "205c7a96-3f1c-42a4-91de-f22f15ce9cb2", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice_scaled1 = df_saleprice_scaled.copy(deep=True)\n", + "imputer = KNNImputer(n_neighbors=5)\n", + "imputed_saleprice_scaled_df = pd.DataFrame(imputer.fit_transform(df_saleprice_scaled1), columns = df_saleprice_scaled1.columns)" + ] + }, + { + "cell_type": "code", + "execution_count": 382, + "id": "a482f58d-73b6-423c-b97a-140884830a0f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
0208500.0208500.0208500.0208500.0
1181500.0181500.0181500.0181500.0
2223500.0223500.0223500.0223500.0
3140000.0140000.0140000.0140000.0
4250000.0250000.0250000.0250000.0
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 208500.0 208500.0 208500.0 208500.0\n", + "1 181500.0 181500.0 181500.0 181500.0\n", + "2 223500.0 223500.0 223500.0 223500.0\n", + "3 140000.0 140000.0 140000.0 140000.0\n", + "4 250000.0 250000.0 250000.0 250000.0" + ] + }, + "execution_count": 382, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imputed_saleprice_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 383, + "id": "11f8f5ff-f06d-4ec2-a4e3-1324e807a537", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
00.2410780.2410780.2408550.241078
10.2035830.2035830.2035830.203583
20.2619080.2619080.2619080.261908
30.1459520.1459520.1459520.145952
40.2987090.2987090.2987090.298709
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 0.241078 0.241078 0.240855 0.241078\n", + "1 0.203583 0.203583 0.203583 0.203583\n", + "2 0.261908 0.261908 0.261908 0.261908\n", + "3 0.145952 0.145952 0.145952 0.145952\n", + "4 0.298709 0.298709 0.298709 0.298709" + ] + }, + "execution_count": 383, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imputed_saleprice_scaled_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "d9fd7fa1-4ce0-43be-9955-55ef759d930b", + "metadata": {}, + "source": [ + "#### Check % missing in saleprice and saleprice_scaled DF" + ] + }, + { + "cell_type": "code", + "execution_count": 384, + "id": "9ed0d36a-9584-4e3b-9201-2ac36827bce9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(imputed_saleprice_df))" + ] + }, + { + "cell_type": "code", + "execution_count": 385, + "id": "7c842fce-bbd5-4c2c-bb1a-db5df92f6315", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(imputed_saleprice_scaled_df))" + ] + }, + { + "cell_type": "markdown", + "id": "ac47abb1-df5f-4686-bc67-6617140c008c", + "metadata": {}, + "source": [ + "#### Store the list of disfferences between Org. and Imputed Value" + ] + }, + { + "cell_type": "code", + "execution_count": 386, + "id": "99e04554-568d-4efa-a110-768b50dfaee6", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_diff_1 = []\n", + "sp_diff_5 = []\n", + "sp_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(imputed_saleprice_df['sp_copy_1_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", + " sp_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(imputed_saleprice_df['sp_copy_5_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", + " sp_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(imputed_saleprice_df['sp_copy_10_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", + " sp_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 387, + "id": "92204f8a-497c-470d-a770-59165d226cc9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_diff_1))\n", + "print(len(sp_diff_5))\n", + "print(len(sp_diff_10))" + ] + }, + { + "cell_type": "code", + "execution_count": 388, + "id": "b8875fff-0289-4dd9-92c1-78dc9b730d22", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_scaled_diff_1 = []\n", + "sp_scaled_diff_5 = []\n", + "sp_scaled_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(imputed_saleprice_scaled_df['sp_copy_1_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", + " sp_scaled_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(imputed_saleprice_scaled_df['sp_copy_5_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", + " sp_scaled_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(imputed_saleprice_scaled_df['sp_copy_10_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", + " sp_scaled_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 389, + "id": "40192344-79a4-444c-a12a-2201dc5aa0c1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_scaled_diff_1))\n", + "print(len(sp_scaled_diff_5))\n", + "print(len(sp_scaled_diff_10))" + ] + }, + { + "cell_type": "code", + "execution_count": 390, + "id": "a95bd45c-8a2f-4159-8306-399ec18a4c0f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0.0, 0.0, 0.0, 0.0, 0.0]" + ] + }, + "execution_count": 390, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sp_scaled_diff_1[:5]" + ] + }, + { + "cell_type": "code", + "execution_count": 391, + "id": "0f73d420-8842-4062-ae17-158a0a25e169", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[10.0, 20.0, 80.0, 220.0, 0.0]" + ] + }, + "execution_count": 391, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sp_diff_1[:5]" + ] + }, + { + "cell_type": "markdown", + "id": "a40fd400-913b-4011-b0b9-dd3ca0d5827a", + "metadata": {}, + "source": [ + "#### Calculate the mean and var of list of diff. KNN - SalePrice" + ] + }, + { + "cell_type": "code", + "execution_count": 392, + "id": "80267827-7f73-49ff-b200-27cdb2963756", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 170.0 and varience 1% is 42400.0\n", + "The mean of 5% is 444.9439999999997 and varience 5% is 2554554.1584639903\n", + "The mean of 10% is 444.9439999999997 and varience 10% is 6304766.8341439795\n" + ] + } + ], + "source": [ + "m1 = sum(sp_diff_1) / len(sp_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_diff_1) / len(sp_diff_1)\n", + "\n", + "m5 = sum(sp_diff_5) / len(sp_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_diff_5) / len(sp_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_diff_10) / len(sp_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_diff_10) / len(sp_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 393, + "id": "358545ff-2fcf-4c99-9049-4eaf6dd110bd", + "metadata": {}, + "outputs": [], + "source": [ + "df_knn_saleprice = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", + " '5%_saleprice': [m5, var_res5],\n", + " '10%_saleprice': [m10, var_res10]}, orient='index')\n", + "df_knn_saleprice.columns=['diff. list Mean(KNN)', 'diff. list Var.(KNN)']" + ] + }, + { + "cell_type": "code", + "execution_count": 394, + "id": "3714c8f9-58db-40a7-b5a2-6bb7e788b734", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN)diff. list Var.(KNN)
1%_saleprice170.0004.240000e+04
5%_saleprice444.9442.554554e+06
10%_saleprice564.7846.304767e+06
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(KNN) diff. list Var.(KNN)\n", + "1%_saleprice 170.000 4.240000e+04\n", + "5%_saleprice 444.944 2.554554e+06\n", + "10%_saleprice 564.784 6.304767e+06" + ] + }, + "execution_count": 394, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_knn_saleprice" + ] + }, + { + "cell_type": "markdown", + "id": "fd7608a8-c5fb-425c-a340-af01801ee349", + "metadata": {}, + "source": [ + "#### Calculate the mean and var of list of diff. KNN - SalePrice scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 395, + "id": "bb03017f-3d91-48d9-8ebf-7cb5c25fadc3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.0 and varience 1% is 0.0\n", + "The mean of 5% is 2.6301902513541363e-05 and varience 5% is 2.134349753649814e-08\n", + "The mean of 10% is 2.6301902513541363e-05 and varience 10% is 1.417383473391258e-08\n" + ] + } + ], + "source": [ + "m1 = sum(sp_scaled_diff_1) / len(sp_scaled_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_scaled_diff_1) / len(sp_scaled_diff_1)\n", + "\n", + "m5 = sum(sp_scaled_diff_5) / len(sp_scaled_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_scaled_diff_5) / len(sp_scaled_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_scaled_diff_10) / len(sp_scaled_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_scaled_diff_10) / len(sp_scaled_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 396, + "id": "290d8db2-c9f4-4028-ab44-ad68c9e7b3c5", + "metadata": {}, + "outputs": [], + "source": [ + "df_knn_saleprice_scaled = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", + " '5%_saleprice': [m5, var_res5],\n", + " '10%_saleprice': [m10, var_res10]}, orient='index')\n", + "df_knn_saleprice_scaled.columns=['diff. list Mean(KNN) scaled', 'diff. list Var.(KNN) scaled']" + ] + }, + { + "cell_type": "code", + "execution_count": 397, + "id": "89347fd7-d87d-42bb-b375-a75417c395de", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN) scaleddiff. list Var.(KNN) scaled
1%_saleprice0.0000000.000000e+00
5%_saleprice0.0000262.134350e-08
10%_saleprice0.0000321.417383e-08
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(KNN) scaled diff. list Var.(KNN) scaled\n", + "1%_saleprice 0.000000 0.000000e+00\n", + "5%_saleprice 0.000026 2.134350e-08\n", + "10%_saleprice 0.000032 1.417383e-08" + ] + }, + "execution_count": 397, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_knn_saleprice_scaled" + ] + }, + { + "cell_type": "markdown", + "id": "c984dc69-f85f-4f1b-8c94-4afb48c1c8db", + "metadata": {}, + "source": [ + "### Perform MEAN imputation" + ] + }, + { + "cell_type": "code", + "execution_count": 398, + "id": "008bc14f-45e7-42d8-b843-2fee7bcf26c2", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice2 = df_saleprice.copy(deep=True)\n", + "df_saleprice_scaled2 = df_saleprice_scaled.copy(deep=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 399, + "id": "bd71dc1a-f137-46ed-bf2b-f3d87fd4b6a0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice2))" + ] + }, + { + "cell_type": "code", + "execution_count": 400, + "id": "46237cfd-6361-466f-b66f-32f5940149d6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice_scaled2))" + ] + }, + { + "cell_type": "markdown", + "id": "64465299-5620-47b9-a28d-afb5494f279e", + "metadata": {}, + "source": [ + "#### Impute Mean values in missing for saleprice and saleprice_scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 401, + "id": "28cf6b75-eebf-4758-94ec-4b3536f2c659", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice2['sp_copy_1_percent'] = df_saleprice2['sp_copy_1_percent'].fillna(df_saleprice2['sp_copy_1_percent'].mean())\n", + "df_saleprice2['sp_copy_5_percent'] = df_saleprice2['sp_copy_5_percent'].fillna(df_saleprice2['sp_copy_5_percent'].mean())\n", + "df_saleprice2['sp_copy_10_percent'] = df_saleprice2['sp_copy_10_percent'].fillna(df_saleprice2['sp_copy_10_percent'].mean())" + ] + }, + { + "cell_type": "code", + "execution_count": 402, + "id": "2409dd8c-3cd0-4742-b0ac-14dea1fdb504", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice_scaled2['sp_copy_1_percent'] = df_saleprice_scaled2['sp_copy_1_percent'].fillna(df_saleprice_scaled2['sp_copy_1_percent'].mean())\n", + "df_saleprice_scaled2['sp_copy_5_percent'] = df_saleprice_scaled2['sp_copy_5_percent'].fillna(df_saleprice_scaled2['sp_copy_5_percent'].mean())\n", + "df_saleprice_scaled2['sp_copy_10_percent'] = df_saleprice_scaled2['sp_copy_10_percent'].fillna(df_saleprice_scaled2['sp_copy_10_percent'].mean())" + ] + }, + { + "cell_type": "markdown", + "id": "62377754-b682-45e5-8faa-1a4a186bd3c7", + "metadata": {}, + "source": [ + "#### After MEAN imputation - Saleprice and saleprice scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 403, + "id": "6c448556-55f4-4685-aed2-6b67d5ad8a2a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice2))" + ] + }, + { + "cell_type": "code", + "execution_count": 404, + "id": "d9775fbf-7a72-4352-b446-488e9d25b6a2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice_scaled2))" + ] + }, + { + "cell_type": "code", + "execution_count": 407, + "id": "136f87e6-a4af-4229-b36a-695f712deee5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
571120000120000.0120000.000000182343.817778
2223500223500.0223500.000000223500.000000
313375000375000.0375000.000000375000.000000
377340000340000.0182457.342105182343.817778
987395192395192.0395192.000000395192.000000
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "571 120000 120000.0 120000.000000 182343.817778\n", + "2 223500 223500.0 223500.000000 223500.000000\n", + "313 375000 375000.0 375000.000000 375000.000000\n", + "377 340000 340000.0 182457.342105 182343.817778\n", + "987 395192 395192.0 395192.000000 395192.000000" + ] + }, + "execution_count": 407, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice2.sample(5)" + ] + }, + { + "cell_type": "code", + "execution_count": 409, + "id": "784cb61c-78f8-4b31-b709-379c50024dca", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
2160.2431610.2431610.2431610.243161
10.2035830.2035830.2035830.203583
5750.1160950.1160950.1160950.116095
3970.1869180.1869180.1869180.205253
7030.1459520.1459520.1459520.145952
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "216 0.243161 0.243161 0.243161 0.243161\n", + "1 0.203583 0.203583 0.203583 0.203583\n", + "575 0.116095 0.116095 0.116095 0.116095\n", + "397 0.186918 0.186918 0.186918 0.205253\n", + "703 0.145952 0.145952 0.145952 0.145952" + ] + }, + "execution_count": 409, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice_scaled2.sample(5)" + ] + }, + { + "cell_type": "markdown", + "id": "33c1f3b7-5afc-45cb-8b43-9682ec87156d", + "metadata": {}, + "source": [ + "#### Create List of differences for saleprice and saleprice_scaled Dataframes" + ] + }, + { + "cell_type": "code", + "execution_count": 410, + "id": "d2faf410-f83e-4ccb-89d4-e6f8c7adffbb", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_mean_diff_1 = []\n", + "sp_mean_diff_5 = []\n", + "sp_mean_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(df_saleprice2['sp_copy_1_percent'][i] - df_saleprice2['SalePrice'][i])\n", + " sp_mean_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(df_saleprice2['sp_copy_5_percent'][i] - df_saleprice2['SalePrice'][i])\n", + " sp_mean_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(df_saleprice2['sp_copy_10_percent'][i] - df_saleprice2['SalePrice'][i])\n", + " sp_mean_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 411, + "id": "789b07c5-530a-4111-8c97-f5297f7da5e4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_mean_diff_1))\n", + "print(len(sp_mean_diff_5))\n", + "print(len(sp_mean_diff_10))" + ] + }, + { + "cell_type": "code", + "execution_count": 412, + "id": "4fec222c-2420-41af-9e2a-d9773e1d6259", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_scaled_mean_diff_1 = []\n", + "sp_scaled_mean_diff_5 = []\n", + "sp_scaled_mean_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(df_saleprice_scaled2['sp_copy_1_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", + " sp_scaled_mean_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(df_saleprice_scaled2['sp_copy_5_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", + " sp_scaled_mean_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(df_saleprice_scaled2['sp_copy_10_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", + " sp_scaled_mean_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 413, + "id": "de9bf1de-68fe-4894-915a-7069b386123f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_scaled_mean_diff_1))\n", + "print(len(sp_scaled_mean_diff_5))\n", + "print(len(sp_scaled_mean_diff_10))" + ] + }, + { + "cell_type": "markdown", + "id": "f7b93757-d1a7-41a1-85fa-3ee77734be5b", + "metadata": {}, + "source": [ + "#### Calculate mean and var of list of diff. - MEAN impute SalePrice" + ] + }, + { + "cell_type": "code", + "execution_count": 414, + "id": "c60d3aad-33f0-48f4-8bb0-f8af45e33e1e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 55971.63676767676 and varience 1% is 1103367192.190047\n", + "The mean of 5% is 58478.24210526314 and varience 5% is 3139731297.2794733\n", + "The mean of 10% is 58478.24210526314 and varience 10% is 3846674638.263318\n" + ] + } + ], + "source": [ + "m1 = sum(sp_mean_diff_1) / len(sp_mean_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_mean_diff_1) / len(sp_mean_diff_1)\n", + "\n", + "m5 = sum(sp_mean_diff_5) / len(sp_mean_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_mean_diff_5) / len(sp_mean_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_mean_diff_10) / len(sp_mean_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_mean_diff_10) / len(sp_mean_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 415, + "id": "e7f6e5cf-4eaa-4bfe-add2-fc7f600941b7", + "metadata": {}, + "outputs": [], + "source": [ + "df_mean_saleprice = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", + " '5%_saleprice': [m5, var_res5],\n", + " '10%_saleprice': [m10, var_res10]}, orient='index')\n", + "df_mean_saleprice.columns=['diff. list Mean(MI)', 'diff. list Var.(MI)']" + ] + }, + { + "cell_type": "code", + "execution_count": 416, + "id": "cc37eeaf-e3cd-4a83-870d-fab7037eeffe", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI)diff. list Var.(MI)
1%_saleprice55971.6367681.103367e+09
5%_saleprice58478.2421053.139731e+09
10%_saleprice61028.7099113.846675e+09
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(MI) diff. list Var.(MI)\n", + "1%_saleprice 55971.636768 1.103367e+09\n", + "5%_saleprice 58478.242105 3.139731e+09\n", + "10%_saleprice 61028.709911 3.846675e+09" + ] + }, + "execution_count": 416, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_mean_saleprice" + ] + }, + { + "cell_type": "markdown", + "id": "f405f073-1b45-47e8-873b-7a9d34ad0e5c", + "metadata": {}, + "source": [ + "#### Calculate mean and var of list of diff. - MEAN impute SalePrice scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 417, + "id": "2516b4f7-6b79-4636-9bd5-0738343ea355", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.0 and varience 1% is 0.0\n", + "The mean of 5% is 0.00893610697344667 and varience 5% is 0.0014044730755095036\n", + "The mean of 10% is 0.00893610697344667 and varience 10% is 0.0004431848362889144\n" + ] + } + ], + "source": [ + "m1 = sum(sp_scaled_mean_diff_1) / len(sp_scaled_mean_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_scaled_mean_diff_1) / len(sp_scaled_mean_diff_1)\n", + "\n", + "m5 = sum(sp_scaled_mean_diff_5) / len(sp_scaled_mean_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_scaled_mean_diff_5) / len(sp_scaled_mean_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_scaled_mean_diff_10) / len(sp_scaled_mean_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_scaled_mean_diff_10) / len(sp_scaled_mean_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 418, + "id": "fe6a93b8-d6cb-4d7d-856b-ab4ee8fe78fc", + "metadata": {}, + "outputs": [], + "source": [ + "df_mean_saleprice_scaled = pd.DataFrame.from_dict({'1%_saleprice_scaled': [m1, var_res1],\n", + " '5%_saleprice_scaled': [m5, var_res5],\n", + " '10%_saleprice_scaled': [m10, var_res10]}, orient='index')\n", + "df_mean_saleprice_scaled.columns=['diff. list Mean(MI) scaled', 'diff. list Var.(MI) scaled']" + ] + }, + { + "cell_type": "code", + "execution_count": 419, + "id": "e74c35ed-7c2d-44ab-b6c2-4d81c2c6b6bb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI) scaleddiff. list Var.(MI) scaled
1%_saleprice_scaled0.0000000.000000
5%_saleprice_scaled0.0089360.001404
10%_saleprice_scaled0.0074920.000443
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(MI) scaled diff. list Var.(MI) scaled\n", + "1%_saleprice_scaled 0.000000 0.000000\n", + "5%_saleprice_scaled 0.008936 0.001404\n", + "10%_saleprice_scaled 0.007492 0.000443" + ] + }, + "execution_count": 419, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_mean_saleprice_scaled" + ] + }, + { + "cell_type": "markdown", + "id": "876b979a-f5c4-43a7-9ead-d5d866bef078", + "metadata": {}, + "source": [ + "# 2.2 Housing Data Results - KNN and MEAN" + ] + }, + { + "cell_type": "code", + "execution_count": 420, + "id": "fea4b521-03a3-46ce-b217-27225eb868af", + "metadata": {}, + "outputs": [], + "source": [ + "results1 = pd.concat([df_knn_saleprice, df_knn_saleprice_scaled, df_mean_saleprice, df_mean_saleprice_scaled])" + ] + }, + { + "cell_type": "code", + "execution_count": 421, + "id": "631729d6-e853-4ba5-b5fd-4e632ec00d5f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN)diff. list Var.(KNN)diff. list Mean(KNN) scaleddiff. list Var.(KNN) scaleddiff. list Mean(MI)diff. list Var.(MI)diff. list Mean(MI) scaleddiff. list Var.(MI) scaled
1%_saleprice170.0004.240000e+04NaNNaNNaNNaNNaNNaN
5%_saleprice444.9442.554554e+06NaNNaNNaNNaNNaNNaN
10%_saleprice564.7846.304767e+06NaNNaNNaNNaNNaNNaN
1%_salepriceNaNNaN0.0000000.000000e+00NaNNaNNaNNaN
5%_salepriceNaNNaN0.0000262.134350e-08NaNNaNNaNNaN
10%_salepriceNaNNaN0.0000321.417383e-08NaNNaNNaNNaN
1%_salepriceNaNNaNNaNNaN55971.6367681.103367e+09NaNNaN
5%_salepriceNaNNaNNaNNaN58478.2421053.139731e+09NaNNaN
10%_salepriceNaNNaNNaNNaN61028.7099113.846675e+09NaNNaN
1%_saleprice_scaledNaNNaNNaNNaNNaNNaN0.0000000.000000
5%_saleprice_scaledNaNNaNNaNNaNNaNNaN0.0089360.001404
10%_saleprice_scaledNaNNaNNaNNaNNaNNaN0.0074920.000443
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(KNN) diff. list Var.(KNN) \\\n", + "1%_saleprice 170.000 4.240000e+04 \n", + "5%_saleprice 444.944 2.554554e+06 \n", + "10%_saleprice 564.784 6.304767e+06 \n", + "1%_saleprice NaN NaN \n", + "5%_saleprice NaN NaN \n", + "10%_saleprice NaN NaN \n", + "1%_saleprice NaN NaN \n", + "5%_saleprice NaN NaN \n", + "10%_saleprice NaN NaN \n", + "1%_saleprice_scaled NaN NaN \n", + "5%_saleprice_scaled NaN NaN \n", + "10%_saleprice_scaled NaN NaN \n", + "\n", + " diff. list Mean(KNN) scaled \\\n", + "1%_saleprice NaN \n", + "5%_saleprice NaN \n", + "10%_saleprice NaN \n", + "1%_saleprice 0.000000 \n", + "5%_saleprice 0.000026 \n", + "10%_saleprice 0.000032 \n", + "1%_saleprice NaN \n", + "5%_saleprice NaN \n", + "10%_saleprice NaN \n", + "1%_saleprice_scaled NaN \n", + "5%_saleprice_scaled NaN \n", + "10%_saleprice_scaled NaN \n", + "\n", + " diff. list Var.(KNN) scaled diff. list Mean(MI) \\\n", + "1%_saleprice NaN NaN \n", + "5%_saleprice NaN NaN \n", + "10%_saleprice NaN NaN \n", + "1%_saleprice 0.000000e+00 NaN \n", + "5%_saleprice 2.134350e-08 NaN \n", + "10%_saleprice 1.417383e-08 NaN \n", + "1%_saleprice NaN 55971.636768 \n", + "5%_saleprice NaN 58478.242105 \n", + "10%_saleprice NaN 61028.709911 \n", + "1%_saleprice_scaled NaN NaN \n", + "5%_saleprice_scaled NaN NaN \n", + "10%_saleprice_scaled NaN NaN \n", + "\n", + " diff. list Var.(MI) diff. list Mean(MI) scaled \\\n", + "1%_saleprice NaN NaN \n", + "5%_saleprice NaN NaN \n", + "10%_saleprice NaN NaN \n", + "1%_saleprice NaN NaN \n", + "5%_saleprice NaN NaN \n", + "10%_saleprice NaN NaN \n", + "1%_saleprice 1.103367e+09 NaN \n", + "5%_saleprice 3.139731e+09 NaN \n", + "10%_saleprice 3.846675e+09 NaN \n", + "1%_saleprice_scaled NaN 0.000000 \n", + "5%_saleprice_scaled NaN 0.008936 \n", + "10%_saleprice_scaled NaN 0.007492 \n", + "\n", + " diff. list Var.(MI) scaled \n", + "1%_saleprice NaN \n", + "5%_saleprice NaN \n", + "10%_saleprice NaN \n", + "1%_saleprice NaN \n", + "5%_saleprice NaN \n", + "10%_saleprice NaN \n", + "1%_saleprice NaN \n", + "5%_saleprice NaN \n", + "10%_saleprice NaN \n", + "1%_saleprice_scaled 0.000000 \n", + "5%_saleprice_scaled 0.001404 \n", + "10%_saleprice_scaled 0.000443 " + ] + }, + "execution_count": 421, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "results1" + ] + }, + { + "cell_type": "code", + "execution_count": 422, + "id": "a255c5bc-c062-4029-8f18-0c7644ca1d7c", + "metadata": {}, + "outputs": [], + "source": [ + "results1.to_csv('housing_data_saleprice_KNN_Mean_results.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9b0060e-129c-465e-a2a5-c3113ac4b936", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "pytorch_kz_env", + "language": "python", + "name": "pytorch_kz_env" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/Imputation_best_practices/random_numbers_1000.csv b/notebooks/Imputation_best_practices/random_numbers_1000.csv new file mode 100644 index 0000000..b988bad --- /dev/null +++ b/notebooks/Imputation_best_practices/random_numbers_1000.csv @@ -0,0 +1,1001 @@ +,number +0,0.14461602473455892 +1,0.07751503129173953 +2,0.15593297226701996 +3,0.09720879582042008 +4,0.32375017402684214 +5,0.686823745565341 +6,0.7068035159437503 +7,0.9167216890721541 +8,0.6352048775376901 +9,0.17132904054220055 +10,0.8159661332230377 +11,0.16475992352396795 +12,0.0409370627667629 +13,0.16726651783050572 +14,0.9709841404608549 +15,0.7314646963631376 +16,0.3426860074270154 +17,0.03452867763070577 +18,0.3574832521777054 +19,0.5745017180628896 +20,0.9464018964648249 +21,0.17346442317598176 +22,0.7981877585797893 +23,0.7809787573425518 +24,0.5238193208352585 +25,0.7821735568253659 +26,0.9934482007890996 +27,0.4184423331593896 +28,0.2599014381523176 +29,0.79832254805514 +30,0.6041862665264831 +31,0.3819864440431342 +32,0.8521701748665009 +33,0.3126469510739037 +34,0.573165703657289 +35,0.6265563684951247 +36,0.739416657331853 +37,0.012060677103418738 +38,0.9526287180476393 +39,0.3919187115227588 +40,0.2638910529614693 +41,0.28055121530104343 +42,0.5573435702875359 +43,0.810470016341365 +44,0.5595615325523974 +45,0.408760756112558 +46,0.8630495060594643 +47,0.8614542990838314 +48,0.8236790421079785 +49,0.445107982060686 +50,0.9240480240430241 +51,0.17212099430841699 +52,0.2821871607285322 +53,0.37501938886942654 +54,0.4401439635045862 +55,0.1316322082815632 +56,0.06144638522796442 +57,0.9719025725097523 +58,0.6437628611013991 +59,0.18965508288943556 +60,0.06647339880458658 +61,0.9432875072199843 +62,0.9635593500723799 +63,0.8159138106628153 +64,0.5268141359426226 +65,0.8097577290919002 +66,0.10832871122562193 +67,0.513926863373751 +68,0.5574679474011387 +69,0.23117155673924017 +70,0.7988683863124257 +71,0.14232155967666804 +72,0.4114506075996932 +73,0.028703811806714996 +74,0.15511224785648736 +75,0.5179635133770123 +76,0.6343922699321491 +77,0.5442703351502044 +78,0.2051777299642784 +79,0.9514959457303863 +80,0.8616963431169906 +81,0.9260797192939593 +82,0.6837050092238902 +83,0.6341651538285088 +84,0.47009701258761005 +85,0.6290009641921982 +86,0.9976095248457479 +87,0.6766165875739423 +88,0.34775785853790764 +89,0.24721164403263118 +90,0.7644613432516099 +91,0.8578411267105046 +92,0.02847593788616165 +93,0.7352417508308864 +94,0.6439666934556955 +95,0.4145386388213331 +96,0.9000774058908544 +97,0.20984159212668807 +98,0.5736834527493817 +99,0.5731814122745401 +100,0.39175113064248857 +101,0.9414042225202869 +102,0.35865018640717594 +103,0.34942147114579614 +104,0.6287322577319368 +105,0.5640558939154473 +106,0.9935619072485498 +107,0.3230972874260011 +108,0.30050448033239197 +109,0.8535359869169682 +110,0.8186071691655027 +111,0.8507126794809163 +112,0.11848293702439716 +113,0.34039997170201786 +114,0.24848934681272938 +115,0.8713564278618446 +116,0.7192981378269337 +117,0.5612771185476495 +118,0.3001718489057721 +119,0.5582566234063182 +120,0.20715922789136187 +121,0.24718349962906172 +122,0.9096809353144786 +123,0.9496126251594162 +124,0.19298962232482253 +125,0.6823143045816399 +126,0.2950869303839806 +127,0.700872866143569 +128,0.9246255564110638 +129,0.3918411220739513 +130,0.5046695500081352 +131,0.40242035593564884 +132,0.5348070625842399 +133,0.6190144238291141 +134,0.6527067332418969 +135,0.7798534811708006 +136,0.8371435153002993 +137,0.7256654504898371 +138,0.19486710733751433 +139,0.17061227388763445 +140,0.3866266766943538 +141,0.9861342050546121 +142,0.12499832976125236 +143,0.4076100289319884 +144,0.24405060656519562 +145,0.24658924623282708 +146,0.31303910086742404 +147,0.13582549628998997 +148,0.4267352707490074 +149,0.6860815270131422 +150,0.2632104445655937 +151,0.7095899448677616 +152,0.30697391312148903 +153,0.15020764760355143 +154,0.33008237434926957 +155,0.24730791798017127 +156,0.7732146302465086 +157,0.3986960975344779 +158,0.878302550945857 +159,0.3073561016445441 +160,0.21123045619113257 +161,0.5806664509148879 +162,0.8984369263318096 +163,0.8363942698985983 +164,0.2812623945509036 +165,0.10724622968453401 +166,0.5703943012638906 +167,0.7309007201275504 +168,0.6865969394598082 +169,0.17355862259884247 +170,0.41747139600619776 +171,0.8046329781439144 +172,0.29734663284924356 +173,0.6874907011989809 +174,0.27926268019676004 +175,0.16857167772740067 +176,0.808320103826969 +177,0.22397888146185907 +178,0.4961137292567884 +179,0.39791460648438426 +180,0.749624236829485 +181,0.8166672255804612 +182,0.5416591595071085 +183,0.7784968348980786 +184,0.5246274130247313 +185,0.6165788811775392 +186,0.18993747860389354 +187,0.4375903866391334 +188,0.8977799452863308 +189,0.8974808404906014 +190,0.7833353163003136 +191,0.5735505446147654 +192,0.8592478266591742 +193,0.555628191461239 +194,0.29218190018690193 +195,0.6823254024415241 +196,0.7253556992028032 +197,0.6348979373592366 +198,0.738955355288769 +199,0.40548956817360793 +200,0.9965074549246696 +201,0.6680475408833246 +202,0.4753087000915296 +203,0.8154531729554498 +204,0.39674071637462927 +205,0.3465424212251109 +206,0.3010873336265142 +207,0.3453059844140016 +208,0.3376649450698975 +209,0.4520568726021712 +210,0.7102711170123417 +211,0.5676304992868505 +212,0.246451823758292 +213,0.3045971494873321 +214,0.9191799326806603 +215,0.09062317707388845 +216,0.6456030768852257 +217,0.8145182625891805 +218,0.3502989381872097 +219,0.5454669053640021 +220,0.9229510982790893 +221,0.5017605011244138 +222,0.5814298938642755 +223,0.212077064497179 +224,0.9084673048697015 +225,0.8420689009087419 +226,0.09544595716628035 +227,0.5428219386076877 +228,0.334040059452826 +229,0.5883742904617911 +230,0.6681527250828868 +231,0.920066967991107 +232,0.6980014815164323 +233,0.5140583511099508 +234,0.574062901794968 +235,0.8671650796521554 +236,0.29309281744572635 +237,0.6255644089859125 +238,0.41377688075614283 +239,0.6541722779053092 +240,0.7022455597573617 +241,0.7027961835253476 +242,0.32866027307469425 +243,0.9438823677034145 +244,0.6392304917718383 +245,0.35610068008813955 +246,0.5109988272940061 +247,0.7549785046509206 +248,0.911498498846909 +249,0.7269132750864981 +250,0.43346849143235944 +251,0.9613052659398792 +252,0.06410207161162618 +253,0.7224542800953787 +254,0.8605028822342475 +255,0.9379303538857604 +256,0.11890111097053702 +257,0.06560232272410749 +258,0.9815175258058294 +259,0.5816233574934034 +260,0.3223771211316614 +261,0.010794999021216611 +262,0.48232848210912416 +263,0.6888091652734284 +264,0.7510123953710294 +265,0.3931342633771988 +266,0.4285185942589612 +267,0.028804295777431044 +268,0.7471054611787746 +269,0.5188475627728396 +270,0.3699806335289325 +271,0.6733240981418717 +272,0.455659972278607 +273,0.8865920570538507 +274,0.9773310825483524 +275,0.9114683092627319 +276,0.7234740743957591 +277,0.47378640650570536 +278,0.9044322182580692 +279,0.6490971485609244 +280,0.9325706015784121 +281,0.15806103989245135 +282,0.20431604755502109 +283,0.9516960107212825 +284,0.17933034496530176 +285,0.10632943259433447 +286,0.20529052976827733 +287,0.26644977396966907 +288,0.990842732357776 +289,0.6626056375310618 +290,0.8934023242009224 +291,0.6087787761836707 +292,0.6622123753279109 +293,0.2795715500728444 +294,0.7356211918792761 +295,0.023450952083761578 +296,0.29930766895885463 +297,0.9605253146799532 +298,0.4773205356946918 +299,0.896685482640458 +300,0.20788119046629716 +301,0.21907107928738412 +302,0.3417751133430835 +303,0.8785812995819484 +304,0.7629857606713326 +305,0.10409839946928867 +306,0.5375122454578438 +307,0.12610808266796247 +308,0.9207106566062669 +309,0.6614470367535862 +310,0.6646296886200127 +311,0.02517423927343887 +312,0.5355435671395777 +313,0.9639505712726043 +314,0.8427700240424094 +315,0.5173256280251634 +316,0.6809361625916177 +317,0.25269387981635383 +318,7.39014254360626e-05 +319,0.6832379417409375 +320,0.3814705574477538 +321,0.2953366513034189 +322,0.8601629667491553 +323,0.4116625534183441 +324,0.20248827761656263 +325,0.0950677170887495 +326,0.37432668808858527 +327,0.5002586204770462 +328,0.5903766299860601 +329,0.4069147751233232 +330,0.46587616114566655 +331,0.20767274566478722 +332,0.4405095567714371 +333,0.7561490702983013 +334,0.9691510044256642 +335,0.9835349892112961 +336,0.08167974686852508 +337,0.011831197129136273 +338,0.2533369151703784 +339,0.7258386397040382 +340,0.1533224004672512 +341,0.16976063838308353 +342,0.3535761067133554 +343,0.9558080514913609 +344,0.34787269425215606 +345,0.6384858181781367 +346,0.19142808499268715 +347,0.3723886499126876 +348,0.4610104267479409 +349,0.7386414627232165 +350,0.5547224736511918 +351,0.07560627992824742 +352,0.38543929036328295 +353,0.023870001618478964 +354,0.08490118558975879 +355,0.9523181200843006 +356,0.835121255953561 +357,0.8313253101018512 +358,0.4477164423027221 +359,0.427173224834863 +360,0.2607502696316568 +361,0.6518880149684392 +362,0.989596091701078 +363,0.4737188317675711 +364,0.951663574431818 +365,0.6389835611029937 +366,0.4255250760028354 +367,0.36494823219306194 +368,0.10394871793754767 +369,0.08787887115953141 +370,0.05185866702404662 +371,0.5729228447658512 +372,0.3557153056497062 +373,0.14169200930635462 +374,0.6026259214704931 +375,0.6780938325392907 +376,0.0019220493053816456 +377,0.14423401505903843 +378,0.31021740847078616 +379,0.26542859991807166 +380,0.05293698137098246 +381,0.5447383348415423 +382,0.19410883367100906 +383,0.2759766462115508 +384,0.6085305795585376 +385,0.19018564330800136 +386,0.6001023952936514 +387,0.5500869240450543 +388,0.308558554189692 +389,0.613015054522192 +390,0.5053671279653127 +391,0.8033565610860482 +392,0.3190316438196028 +393,0.8430688477494918 +394,0.3907441626865247 +395,0.3749010705929905 +396,0.20374147066354986 +397,0.4445572005828903 +398,0.4325615226381033 +399,0.747347832034453 +400,0.1408237945119577 +401,0.5629196065967164 +402,0.8883715667513505 +403,0.7262344816634011 +404,0.1015240156369166 +405,0.6274596622730756 +406,0.6724938834493908 +407,0.45890555605876826 +408,0.253862163313197 +409,0.20213399227024142 +410,0.9431472444002996 +411,0.4412716272261822 +412,0.6778537756613036 +413,0.5609208700560778 +414,0.7852790417028147 +415,0.8301487622409094 +416,0.0695242591856422 +417,0.5342345164968271 +418,0.020198821857018268 +419,0.11932836566667071 +420,0.7351542137502673 +421,0.879354084852934 +422,0.060390921051916124 +423,0.3517659280158124 +424,0.25831407832342757 +425,0.25041309629182773 +426,0.6324032934179679 +427,0.6905116746744266 +428,0.038781141504878325 +429,0.11872222658971077 +430,0.3402172182577837 +431,0.1117834948318035 +432,0.8974663997148172 +433,0.7721061886641211 +434,0.467763325594456 +435,0.45960484726135 +436,0.11940893902740168 +437,0.8892320824757846 +438,0.056170722740824464 +439,0.8348974660229447 +440,0.8328276290445746 +441,0.015421942378315512 +442,0.6078039146470725 +443,0.9797170916017848 +444,0.817871594488278 +445,0.4281570072853328 +446,0.9826586617461194 +447,0.5714323337805088 +448,0.5655480118995616 +449,0.13163751508874266 +450,0.5727166298844355 +451,0.3876989055629705 +452,0.24625748760449773 +453,0.062376725489559304 +454,0.1868295868142189 +455,0.07519337399332371 +456,0.8615125038568271 +457,0.0430765434686432 +458,0.7784279481001283 +459,0.1559200654309939 +460,0.28457480300272475 +461,0.4833371043049315 +462,0.21688560355701902 +463,0.051055375260327884 +464,0.8764119752087609 +465,0.03830180552041673 +466,0.899276170682331 +467,0.5326669068942715 +468,0.7966592760107886 +469,0.5977938689767619 +470,0.35735055753216216 +471,0.7502306585594846 +472,0.27262195939610845 +473,0.3367003915054816 +474,0.3718378858875636 +475,0.7252726856566986 +476,0.6108078470654391 +477,0.160140124957443 +478,0.640641195165919 +479,0.819043970313203 +480,0.9460930077740923 +481,0.3955113176387407 +482,0.08228064172201954 +483,0.5692148152461914 +484,0.9379027430417781 +485,0.7262721958954546 +486,0.9974714724600596 +487,0.9816411645054782 +488,0.02801478549452141 +489,0.35876394018958924 +490,0.46224300725504386 +491,0.07977812492324099 +492,0.7825821331768681 +493,0.7728747320072956 +494,0.18411522733742114 +495,0.9349933626453013 +496,0.3305156463539396 +497,0.05247324921620988 +498,0.3784435570491954 +499,0.8296025413407634 +500,0.44108727645927825 +501,0.2993358032378495 +502,0.8631126359025391 +503,0.250262827945147 +504,0.09566738091105942 +505,0.7130474946994906 +506,0.2235781443128807 +507,0.7026149405611689 +508,0.7224945548679957 +509,0.6170012611217315 +510,0.20186432914831431 +511,0.7852714452298651 +512,0.8903242744728199 +513,0.1399056906045737 +514,0.17026945833848617 +515,0.514586763470415 +516,0.9736100357614889 +517,0.7746591507784915 +518,0.29437001890274195 +519,0.8027253084378705 +520,0.08386991518130038 +521,0.09136100092018629 +522,0.8983567502463687 +523,0.8868693311046169 +524,0.533466309836137 +525,0.42900189716927073 +526,0.1821870276409372 +527,0.4315150943786541 +528,0.47383956070476785 +529,0.42647315825719867 +530,0.20889106515275513 +531,0.15615589390655582 +532,0.7683598815481214 +533,0.8407774935346721 +534,0.4599058924434972 +535,0.20858605861422153 +536,0.25419023941340724 +537,0.03537597137641857 +538,0.5037011171417803 +539,0.319855948227728 +540,0.6143932185624659 +541,0.11338109816795006 +542,0.6071773224023549 +543,0.6320103598568474 +544,0.17739418618305125 +545,0.9193076779462215 +546,0.539317629461803 +547,0.361121293498606 +548,0.8225521587592494 +549,0.037067189096233966 +550,0.7644376889628157 +551,0.9614375433647248 +552,0.26247829558958613 +553,0.04497704041286332 +554,0.49347237237561237 +555,0.10135820428850206 +556,0.9054759324635467 +557,0.3912479745377101 +558,0.16984308812935767 +559,0.3130327921420567 +560,0.2845393861009978 +561,0.7216547111114262 +562,0.6129838442158642 +563,0.6128072542663652 +564,0.5153838338789999 +565,0.7131085367862817 +566,0.8713477772442941 +567,0.9419360672901563 +568,0.9061770339937525 +569,0.9973713503589123 +570,0.6511737928834931 +571,0.0980714039543844 +572,0.12371358453480508 +573,0.5817580949438432 +574,0.3878197750090975 +575,0.3836838844640248 +576,0.3330772932400339 +577,0.8937920239990277 +578,0.42660379831271933 +579,0.09749777821209016 +580,0.03273234283716975 +581,0.5822939987582022 +582,0.2818759219290342 +583,0.9973773382690185 +584,0.3485811650096795 +585,0.38385951065171464 +586,0.14314846321555819 +587,0.41168484188278187 +588,0.5560325831949468 +589,0.6786651527115524 +590,0.27941662328630534 +591,0.12758615070559087 +592,0.8706880276786881 +593,0.42247163006009736 +594,0.8747921784321767 +595,0.9819789489386005 +596,0.53212913612486 +597,0.6820548577830702 +598,0.14172556124342628 +599,0.8954903213991394 +600,0.8877895505948118 +601,0.2899734461911796 +602,0.39888758518426926 +603,0.5085270928974726 +604,0.5397323464650328 +605,0.5355595876880633 +606,0.6680045600991499 +607,0.07890855054344348 +608,0.36522753036507116 +609,0.7525828516063231 +610,0.8155334605307646 +611,0.948872329161571 +612,0.10085424156574552 +613,0.3063104444859259 +614,0.012248867459916157 +615,0.8332405266792986 +616,0.4477328006875678 +617,0.7381760858313725 +618,0.5381307278002123 +619,0.64442652761133 +620,0.407653279216153 +621,0.988120343671508 +622,0.349242158981631 +623,0.11439639275168989 +624,0.773600974105568 +625,0.3422508667504136 +626,0.35092901992304426 +627,0.6998555631853256 +628,0.5351463864628954 +629,0.6941915466139217 +630,0.27550090759498 +631,0.03955870654832727 +632,0.9737612333749457 +633,0.85659566451438 +634,0.318016024519294 +635,0.07264967870375483 +636,0.6266672136646679 +637,0.5427530067840908 +638,0.08013357115177333 +639,0.27865447324993387 +640,0.8204327600278204 +641,0.6472338718548233 +642,0.8981066937808309 +643,0.9904134149156683 +644,0.7570648348954108 +645,0.04820939759809295 +646,0.49659488586991385 +647,0.2681871451946377 +648,0.05376519761698151 +649,0.1536101940376925 +650,0.2458849441738461 +651,0.19991898782481343 +652,0.49815295225863154 +653,0.7475145062482099 +654,0.5814474904248211 +655,0.9103815228294841 +656,0.8091439841662771 +657,0.044556478634595 +658,0.06582839484468272 +659,0.8723124347377673 +660,0.761407419742959 +661,0.6295611439582762 +662,0.5602756647971817 +663,0.028833108636930782 +664,0.6925154173449602 +665,0.30781547100300766 +666,0.9456746547718861 +667,0.7733519530494579 +668,0.07325928323474962 +669,0.06051359621130603 +670,0.7684091239449635 +671,0.0772898478864189 +672,0.4652145959688888 +673,0.4373876627767307 +674,0.6267684478070814 +675,0.7183418633741062 +676,0.28256468766217413 +677,0.5073826011665699 +678,0.31820311938601464 +679,0.4089168748142118 +680,0.29885921770184043 +681,0.03372851278925548 +682,0.6703170306185748 +683,0.33198869826189814 +684,0.5975405123566822 +685,0.8211657963714585 +686,0.3461079054656666 +687,0.48616250243415104 +688,0.13447950866733605 +689,0.562667191415577 +690,0.7678216928305848 +691,0.4530052286033409 +692,0.5010228200975811 +693,0.4323309760765164 +694,0.36743023729184987 +695,0.1723991626473217 +696,0.4337302869241262 +697,0.24966845326719822 +698,0.642167289966723 +699,0.616830008851879 +700,0.7703637450499222 +701,0.21386173939654995 +702,0.704115745850898 +703,0.6905967742396926 +704,0.14550064889741277 +705,0.6045853103312959 +706,0.03670533871021342 +707,0.7158949195594291 +708,0.5963326610400751 +709,0.7656919572130952 +710,0.16593604258736716 +711,0.37116447793513807 +712,0.8005826062394383 +713,0.041771054650389106 +714,0.6847846478124059 +715,0.4993883882765534 +716,0.1850707225574446 +717,0.5630874044249621 +718,0.37025234599378876 +719,0.7107125656980158 +720,0.4118677519270143 +721,0.7742568360649871 +722,0.8100159822588088 +723,0.3174629757017041 +724,0.5303493054894146 +725,0.8849961235045513 +726,0.3273403729546115 +727,0.6172150375830504 +728,0.15983060531231819 +729,0.4728594510763161 +730,0.4529506215548965 +731,0.5035430872599636 +732,0.004927231548344402 +733,0.1940383807540148 +734,0.14982458424309364 +735,0.8563549025851751 +736,0.03884058951015723 +737,0.28522238435867453 +738,0.8057900651211597 +739,0.03021709036511122 +740,0.07224489509195386 +741,0.056610587902518716 +742,0.9264467821014194 +743,0.8138662549320123 +744,0.41783822642927937 +745,0.8723047253359363 +746,0.18136207963463802 +747,0.7164025688996778 +748,0.8196872616954788 +749,0.8068822585021751 +750,0.007129291396152926 +751,0.2602504030386925 +752,0.46370562857123043 +753,0.163784347412389 +754,0.23315134483036648 +755,0.6177440123966893 +756,0.2561521510607473 +757,0.562548076892661 +758,0.5051861935336659 +759,0.13892890236963107 +760,0.004539613445676105 +761,0.17372524036846493 +762,0.6832015932759417 +763,0.8325857535808265 +764,6.826981312790803e-05 +765,0.19612584863473537 +766,0.4145509719106246 +767,0.2619625834737831 +768,0.24549665294458467 +769,0.27612714237335956 +770,0.8531795517703349 +771,0.047146001044882424 +772,0.562788499298586 +773,0.43099863376962144 +774,0.26050958743406505 +775,0.7788002061420074 +776,0.6743332176478016 +777,0.40066992822420555 +778,0.9760876856806906 +779,0.539119034171984 +780,0.18208901259127885 +781,0.12376735142175199 +782,0.9551514655114575 +783,0.7810294736400567 +784,0.9212583468427701 +785,0.8010043139785669 +786,0.22944051406680832 +787,0.050052241727377766 +788,0.6786745563768194 +789,0.429793629888368 +790,0.42563361699182967 +791,0.6784838537337905 +792,0.2858761720399675 +793,0.2890895011305119 +794,0.025121632825633844 +795,0.25765509253553054 +796,0.43572322499776717 +797,0.6647102169428171 +798,0.10847616026636064 +799,0.2537450603718995 +800,0.24416864473064126 +801,0.0672514263787497 +802,0.16935229953659314 +803,0.27439580112524253 +804,0.4284736191801598 +805,0.8586734606964571 +806,0.4315781202007021 +807,0.09915635234890208 +808,0.44899905032025744 +809,0.013316716483281699 +810,0.8391449274551819 +811,0.5061770521104294 +812,0.0672045714638001 +813,0.2933544809181752 +814,0.18022127393582965 +815,0.8781136361676581 +816,0.5157135259800142 +817,0.46243072336418334 +818,0.6222491687600095 +819,0.8889053056935484 +820,0.04571095891205823 +821,0.1513640763692672 +822,0.7774449453314359 +823,0.5183880690457242 +824,0.2921720252636122 +825,0.09168278609192515 +826,0.39002371887786735 +827,0.3580585061283823 +828,0.12047021435718164 +829,0.6738337221623005 +830,0.21958552211366156 +831,0.5648142473736366 +832,0.23497653874753555 +833,0.16544595712611387 +834,0.040561694693181605 +835,0.7355715205459343 +836,0.9004365787736869 +837,0.5459151013055901 +838,0.7480058346265005 +839,0.7141260383574005 +840,0.1158157631511092 +841,0.9125379342891712 +842,0.3680018768100638 +843,0.7402206231811581 +844,0.2972738079840226 +845,0.8923504613507662 +846,0.5063568640229354 +847,0.24619949696371157 +848,0.5399981903000146 +849,0.7188539530946122 +850,0.648195890336554 +851,0.724518894463568 +852,0.14288147919479144 +853,0.7994514226699949 +854,0.6226355760247099 +855,0.010176035425188967 +856,0.4131692686695717 +857,0.834692399566853 +858,0.49912957372925004 +859,0.00438814293685974 +860,0.3252041908817417 +861,0.534840233118543 +862,0.3587118743837924 +863,0.9677560902733098 +864,0.5973183201684436 +865,0.296691425381007 +866,0.5855079326424412 +867,0.20240300955532187 +868,0.6021550529096645 +869,0.8824421051967469 +870,0.3072946199859422 +871,0.3128979438155097 +872,0.5475105438225643 +873,0.4842448962628426 +874,0.15025538438496855 +875,0.310622456701922 +876,0.6023436011138587 +877,0.5754165898365287 +878,0.6577607923072721 +879,0.7857515237431592 +880,0.22057576301022253 +881,0.8661095076438114 +882,0.910244039608377 +883,0.578456971142587 +884,0.3787935162597653 +885,0.08939098828841929 +886,0.9232626564888574 +887,0.1712490756353049 +888,0.779216672902944 +889,0.3495372334946847 +890,0.47001887737996617 +891,0.29750226759355936 +892,0.2810128485470573 +893,0.2437794575755069 +894,0.2624381305719474 +895,0.8246608579175856 +896,0.6942956761673141 +897,0.11515579868519688 +898,0.1206162339748359 +899,0.26196220525263014 +900,0.5553026135773536 +901,0.40720637901420265 +902,0.9638145298530792 +903,0.4117628415691498 +904,0.31618951259604455 +905,0.11765701103218917 +906,0.33470652854411564 +907,0.7366235956449027 +908,0.7581529716898141 +909,0.9554767313213507 +910,0.8837680591214232 +911,0.12426303151941864 +912,0.13192594906673982 +913,0.13159583337236658 +914,0.8413301780622977 +915,0.5495370639785346 +916,0.8125566245605387 +917,0.764454058143039 +918,0.9022709587116715 +919,0.22879685531861071 +920,0.49057430203325403 +921,0.4724960647844604 +922,0.8055598260756343 +923,0.7603094118394911 +924,0.3728373302689516 +925,0.3568389711535207 +926,0.4241494594670866 +927,0.7538918294606227 +928,0.5278021541536974 +929,0.4605573424438759 +930,0.6738635250250887 +931,0.16054005910324365 +932,0.8428762894592794 +933,0.9518468101445031 +934,0.32776599980321264 +935,0.3459454626103713 +936,0.08290510118997685 +937,0.4134429089919419 +938,0.7577633137424186 +939,0.4360752405153524 +940,0.977898855124461 +941,0.3899549115493246 +942,0.07360874043480192 +943,0.6234394805204561 +944,0.8281399000229284 +945,0.5936401403938281 +946,0.9444301233719021 +947,0.18311569423561358 +948,0.19900897833219744 +949,0.5859537329420677 +950,0.45369641243149117 +951,0.8140494291811821 +952,0.15504116789135103 +953,0.5097058344234562 +954,0.46015129255339193 +955,0.9168374769143446 +956,0.6646855362668478 +957,0.08710995188842596 +958,0.9648211892689712 +959,0.3099412950871465 +960,0.4182764603873177 +961,0.2811470272374724 +962,0.36150098707209977 +963,0.7547921114548144 +964,0.038441021458981206 +965,0.6114605284345398 +966,0.20333754648264146 +967,0.6879693726518868 +968,0.5615887399000671 +969,0.10931708773465398 +970,0.8275712918793767 +971,0.7747109160797243 +972,0.9005913428689535 +973,0.6399242580079716 +974,0.717434307883715 +975,0.0782758727785875 +976,0.05968847507483932 +977,0.9824576958211914 +978,0.02495988725135534 +979,0.2620968894854523 +980,0.010107863826380292 +981,0.2764875736254404 +982,0.18403412415931986 +983,0.1616789092290818 +984,0.3454521050417132 +985,0.433499552863608 +986,0.040911884966301715 +987,0.20484238883308725 +988,0.6675520566953549 +989,0.6160709258598361 +990,0.04474552091720452 +991,0.40241951588041347 +992,0.5873473825076658 +993,0.38212818142632543 +994,0.8770948644179681 +995,0.18210726703943658 +996,0.7879879363150989 +997,0.14870738186047538 +998,0.15312132054135852 +999,0.4747372545447177 From dd3afdbce60d4c1b5ef3a98a5f0943695464b761 Mon Sep 17 00:00:00 2001 From: Shesh Narayan Gupta <91396937+SheshNGupta@users.noreply.github.com> Date: Tue, 7 Jun 2022 21:29:40 -0400 Subject: [PATCH 6/8] Updated notebook --- .../Imputation_best_practices.ipynb | 4557 ----------------- 1 file changed, 4557 deletions(-) delete mode 100644 notebooks/Imputation_best_practices/Imputation_best_practices.ipynb diff --git a/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb b/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb deleted file mode 100644 index 87d582d..0000000 --- a/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb +++ /dev/null @@ -1,4557 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "e2ceaeb0-e282-4c63-97e2-f1dd03810aa2", - "metadata": {}, - "source": [ - "# What to try in this notebook?\n", - "\n", - "#### 1. Get a random number generated dataset from kaggle, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", - "\n", - "Dataset - https://www.kaggle.com/timoboz/random-numbers\n", - "\n", - "#### 2. Use a housing dataset from UCI, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", - "\n", - "Dataset - https://github.com/nikbearbrown/AI_Research_Group/blob/main/Awesome-UCI-Datasets/Classification/House_Price_predication/train.csv" - ] - }, - { - "cell_type": "code", - "execution_count": 101, - "id": "d8fe4103-6e71-4b97-810c-b599a0482944", - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import numpy as np\n", - "import warnings\n", - "warnings.filterwarnings('ignore')\n", - "from sklearn.impute import KNNImputer\n", - "from sklearn.preprocessing import MinMaxScaler" - ] - }, - { - "cell_type": "markdown", - "id": "f95427ef-d6bc-47b8-a516-45a05b238180", - "metadata": {}, - "source": [ - "# 1.1 Random Numbers dataset" - ] - }, - { - "cell_type": "code", - "execution_count": 102, - "id": "03fc0415-cdd2-415b-a273-08037b06afcf", - "metadata": {}, - "outputs": [], - "source": [ - "random_dataset = pd.read_csv('random_numbers_1000.csv')" - ] - }, - { - "cell_type": "code", - "execution_count": 103, - "id": "5ea97930-03cd-48ff-97b9-97e9cd9dde55", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Unnamed: 0number
7827820.955151
3783780.310217
5425420.607177
80800.861696
2822820.204316
9769760.059688
9249240.372837
3293290.406915
1311310.402420
6076070.078909
\n", - "
" - ], - "text/plain": [ - " Unnamed: 0 number\n", - "782 782 0.955151\n", - "378 378 0.310217\n", - "542 542 0.607177\n", - "80 80 0.861696\n", - "282 282 0.204316\n", - "976 976 0.059688\n", - "924 924 0.372837\n", - "329 329 0.406915\n", - "131 131 0.402420\n", - "607 607 0.078909" - ] - }, - "execution_count": 103, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "random_dataset.sample(10)" - ] - }, - { - "cell_type": "code", - "execution_count": 104, - "id": "f19e199b-91aa-4e03-9e07-37f5a574d481", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 1000 entries, 0 to 999\n", - "Data columns (total 2 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 Unnamed: 0 1000 non-null int64 \n", - " 1 number 1000 non-null float64\n", - "dtypes: float64(1), int64(1)\n", - "memory usage: 15.8 KB\n" - ] - } - ], - "source": [ - "random_dataset.info()" - ] - }, - { - "cell_type": "code", - "execution_count": 105, - "id": "382f0f03-b3f4-4244-a95c-e78476fae2ca", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "count 1000.000000\n", - "mean 0.490463\n", - "std 0.284669\n", - "min 0.000068\n", - "25% 0.252124\n", - "50% 0.479825\n", - "75% 0.735584\n", - "max 0.997610\n", - "Name: number, dtype: float64" - ] - }, - "execution_count": 105, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "random_dataset['number'].describe()" - ] - }, - { - "cell_type": "markdown", - "id": "348a0b85-c450-4d5d-a9d2-c57c95964b42", - "metadata": {}, - "source": [ - "#### Create 3 col. for numbers for 1%, 5% and 10% missing data" - ] - }, - { - "cell_type": "code", - "execution_count": 106, - "id": "f5de26b3-17b7-463b-98e4-147a457ca37e", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
00.1446160.1446160.1446160.144616
10.0775150.0775150.0775150.077515
20.1559330.1559330.1559330.155933
30.0972090.0972090.0972090.097209
40.3237500.3237500.3237500.323750
...............
9950.1821070.1821070.1821070.182107
9960.7879880.7879880.7879880.787988
9970.1487070.1487070.1487070.148707
9980.1531210.1531210.1531210.153121
9990.4747370.4747370.4747370.474737
\n", - "

1000 rows × 4 columns

\n", - "
" - ], - "text/plain": [ - " number number_copy_1_percent number_copy_5_percent \\\n", - "0 0.144616 0.144616 0.144616 \n", - "1 0.077515 0.077515 0.077515 \n", - "2 0.155933 0.155933 0.155933 \n", - "3 0.097209 0.097209 0.097209 \n", - "4 0.323750 0.323750 0.323750 \n", - ".. ... ... ... \n", - "995 0.182107 0.182107 0.182107 \n", - "996 0.787988 0.787988 0.787988 \n", - "997 0.148707 0.148707 0.148707 \n", - "998 0.153121 0.153121 0.153121 \n", - "999 0.474737 0.474737 0.474737 \n", - "\n", - " number_copy_10_percent \n", - "0 0.144616 \n", - "1 0.077515 \n", - "2 0.155933 \n", - "3 0.097209 \n", - "4 0.323750 \n", - ".. ... \n", - "995 0.182107 \n", - "996 0.787988 \n", - "997 0.148707 \n", - "998 0.153121 \n", - "999 0.474737 \n", - "\n", - "[1000 rows x 4 columns]" - ] - }, - "execution_count": 106, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_number = random_dataset[['number']]\n", - "df_number['number_copy_1_percent'] = df_number[['number']]\n", - "df_number['number_copy_5_percent'] = df_number[['number']]\n", - "df_number['number_copy_10_percent'] = df_number[['number']]\n", - "df_number" - ] - }, - { - "cell_type": "markdown", - "id": "1ff95002-46a0-454b-97c1-6c189153d459", - "metadata": {}, - "source": [ - "#### Check % missing values in this dataframe" - ] - }, - { - "cell_type": "code", - "execution_count": 107, - "id": "35c38775-26d9-4b1e-97a9-4c46c0d5d92b", - "metadata": {}, - "outputs": [], - "source": [ - "def get_percent_missing(dataframe):\n", - " \n", - " percent_missing = dataframe.isnull().sum() * 100 / len(dataframe)\n", - " missing_value_df = pd.DataFrame({'column_name': dataframe.columns,\n", - " 'percent_missing': percent_missing})\n", - " return missing_value_df" - ] - }, - { - "cell_type": "code", - "execution_count": 108, - "id": "6837b7e5-4444-4914-9c0e-a9cefd2c7b6f", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "number number 0.0\n", - "number_copy_1_percent number_copy_1_percent 0.0\n", - "number_copy_5_percent number_copy_5_percent 0.0\n", - "number_copy_10_percent number_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_number))" - ] - }, - { - "cell_type": "markdown", - "id": "25318ebf-b1bf-4f4b-ba1d-011b27a27f39", - "metadata": {}, - "source": [ - "#### Create missing helper fn" - ] - }, - { - "cell_type": "code", - "execution_count": 109, - "id": "76da9076-d9c8-417e-bcfc-8ce7066d1a53", - "metadata": {}, - "outputs": [], - "source": [ - "def create_missing(dataframe, percent, col):\n", - " dataframe.loc[dataframe.sample(frac = percent).index, col] = np.nan" - ] - }, - { - "cell_type": "markdown", - "id": "9dc43e57-be39-4efe-8131-d6a3423b8d77", - "metadata": {}, - "source": [ - "#### Create missing data in each col" - ] - }, - { - "cell_type": "code", - "execution_count": 110, - "id": "6e8ab693-6043-4ade-b62a-9b3fc9ebf735", - "metadata": {}, - "outputs": [], - "source": [ - "create_missing(df_number, 0.01, 'number_copy_1_percent')\n", - "create_missing(df_number, 0.05, 'number_copy_5_percent')\n", - "create_missing(df_number, 0.1, 'number_copy_10_percent')" - ] - }, - { - "cell_type": "markdown", - "id": "655cb92a-6b63-4498-9c31-d63f11145569", - "metadata": {}, - "source": [ - "#### Check % missing after removing data" - ] - }, - { - "cell_type": "code", - "execution_count": 111, - "id": "412518b5-67ec-4a5a-9720-4a0ce7657d44", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "number number 0.0\n", - "number_copy_1_percent number_copy_1_percent 1.0\n", - "number_copy_5_percent number_copy_5_percent 5.0\n", - "number_copy_10_percent number_copy_10_percent 10.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_number))" - ] - }, - { - "cell_type": "markdown", - "id": "6876e3fc-b878-4560-a3a4-72c36f2a422e", - "metadata": {}, - "source": [ - "#### Store the indices of missing rows" - ] - }, - { - "cell_type": "code", - "execution_count": 112, - "id": "c1860270-add6-4963-9aef-27ef1e171fca", - "metadata": {}, - "outputs": [], - "source": [ - "# Store Index of NaN values in each coloumns\n", - "number_1_idx = list(np.where(df_number['number_copy_1_percent'].isna())[0])\n", - "number_5_idx = list(np.where(df_number['number_copy_5_percent'].isna())[0])\n", - "number_10_idx = list(np.where(df_number['number_copy_10_percent'].isna())[0])" - ] - }, - { - "cell_type": "code", - "execution_count": 113, - "id": "57841da6-b453-40cc-8ecc-702fe4613a74", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Length of number_1_idx is 10 and it contains 1.0% of total data in column | Total rows: 1000\n", - "Length of number_5_idx is 50 and it contains 5.0% of total data in column | Total rows: 1000\n", - "Length of number_10_idx is 100 and it contains 10.0% of total data in column | Total rows: 1000\n" - ] - } - ], - "source": [ - "print(f\"Length of number_1_idx is {len(number_1_idx)} and it contains {(len(number_1_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")\n", - "print(f\"Length of number_5_idx is {len(number_5_idx)} and it contains {(len(number_5_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")\n", - "print(f\"Length of number_10_idx is {len(number_10_idx)} and it contains {(len(number_10_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")" - ] - }, - { - "cell_type": "markdown", - "id": "47469d0b-a8f3-4469-b18c-3a457f7dc373", - "metadata": {}, - "source": [ - "### Perform KNN impute to df_number dataframe" - ] - }, - { - "cell_type": "code", - "execution_count": 114, - "id": "b09c6c85-4ce3-4aeb-bb81-6a698494a58e", - "metadata": {}, - "outputs": [], - "source": [ - "df_number1 = df_number.copy(deep=True)\n", - "imputer = KNNImputer(n_neighbors=5)\n", - "imputed_number_df = pd.DataFrame(imputer.fit_transform(df_number1), columns = df_number1.columns)\n" - ] - }, - { - "cell_type": "code", - "execution_count": 115, - "id": "2f051a7d-3ebd-4839-aae0-ef125944d613", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
3470.3723890.3723890.3723890.372389
9340.3277660.3277660.3277660.327766
9270.7538920.7538920.7538920.753892
9970.1487070.1487070.1487070.148707
1670.7309010.7309010.7309010.730901
9140.8413300.8413300.8413300.841330
4320.8974660.8974660.8974660.897466
5870.4116850.4116850.4116850.411685
8840.3787940.3787940.3787940.378794
3790.2654290.2654290.2654290.264843
\n", - "
" - ], - "text/plain": [ - " number number_copy_1_percent number_copy_5_percent \\\n", - "347 0.372389 0.372389 0.372389 \n", - "934 0.327766 0.327766 0.327766 \n", - "927 0.753892 0.753892 0.753892 \n", - "997 0.148707 0.148707 0.148707 \n", - "167 0.730901 0.730901 0.730901 \n", - "914 0.841330 0.841330 0.841330 \n", - "432 0.897466 0.897466 0.897466 \n", - "587 0.411685 0.411685 0.411685 \n", - "884 0.378794 0.378794 0.378794 \n", - "379 0.265429 0.265429 0.265429 \n", - "\n", - " number_copy_10_percent \n", - "347 0.372389 \n", - "934 0.327766 \n", - "927 0.753892 \n", - "997 0.148707 \n", - "167 0.730901 \n", - "914 0.841330 \n", - "432 0.897466 \n", - "587 0.411685 \n", - "884 0.378794 \n", - "379 0.264843 " - ] - }, - "execution_count": 115, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "imputed_number_df.sample(10)" - ] - }, - { - "cell_type": "markdown", - "id": "ddc79a45-bd2b-44f3-a3c4-aaefa73b43d9", - "metadata": {}, - "source": [ - "#### Check the % missing data in dataframe now" - ] - }, - { - "cell_type": "code", - "execution_count": 116, - "id": "5c98d450-bf5a-46e5-9091-c6a1202a2611", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "number number 0.0\n", - "number_copy_1_percent number_copy_1_percent 0.0\n", - "number_copy_5_percent number_copy_5_percent 0.0\n", - "number_copy_10_percent number_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(imputed_number_df))" - ] - }, - { - "cell_type": "markdown", - "id": "f14476bf-29e6-4d9a-9cd4-9dd56a53b466", - "metadata": {}, - "source": [ - "#### Store the list of differences between org. and Imputed value" - ] - }, - { - "cell_type": "code", - "execution_count": 117, - "id": "3f096800-dc6e-4455-a9e6-2db18884e5ee", - "metadata": {}, - "outputs": [], - "source": [ - "# create list of difference bwtween imputed and orginal value\n", - "\n", - "number_diff_1 = []\n", - "number_diff_5 = []\n", - "number_diff_10 = []\n", - "count = 0\n", - "\n", - "for i in number_1_idx:\n", - " count +=1\n", - " diff1 = abs(imputed_number_df['number_copy_1_percent'][i] - df_number1['number'][i])\n", - " number_diff_1.append(diff1)\n", - " \n", - "\n", - "for i in number_5_idx:\n", - " diff5 = abs(imputed_number_df['number_copy_5_percent'][i] - df_number1['number'][i])\n", - " number_diff_5.append(diff5)\n", - "\n", - "for i in number_10_idx:\n", - " diff10 = abs(imputed_number_df['number_copy_10_percent'][i] - df_number1['number'][i])\n", - " number_diff_10.append(diff10)" - ] - }, - { - "cell_type": "code", - "execution_count": 118, - "id": "4a2c29fc-99f3-4624-808e-437d3983cabb", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(number_diff_1))\n", - "print(len(number_diff_5))\n", - "print(len(number_diff_10))" - ] - }, - { - "cell_type": "markdown", - "id": "4ec4adbe-5571-40e3-90ba-92cb431161ca", - "metadata": {}, - "source": [ - "### Calculate the mean and varience of list of differences KNN" - ] - }, - { - "cell_type": "code", - "execution_count": 119, - "id": "1163cb62-9dc4-427e-b5cf-20bf3e16d79b", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The mean of 1% is 0.0007902710470742466 and varience 1% is 4.5687016451605466e-07\n", - "The mean of 5% is 0.000675654857997236 and varience 5% is 3.072444468179742e-07\n", - "The mean of 10% is 0.000675654857997236 and varience 10% is 2.480608628449602e-07\n" - ] - } - ], - "source": [ - "m1 = sum(number_diff_1) / len(number_diff_1)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res1 = sum((xi - m1) ** 2 for xi in number_diff_1) / len(number_diff_1)\n", - "\n", - "m5 = sum(number_diff_5) / len(number_diff_5)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res5 = sum((xii - m5) ** 2 for xii in number_diff_5) / len(number_diff_5)\n", - "\n", - "\n", - "m10 = sum(number_diff_10) / len(number_diff_10)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res10 = sum((xiii - m10) ** 2 for xiii in number_diff_10) / len(number_diff_10)\n", - "\n", - "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", - "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", - "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 120, - "id": "6987d059-7449-44a0-a3c2-8605362a18a0", - "metadata": {}, - "outputs": [], - "source": [ - "df_knn_number = pd.DataFrame.from_dict({'1%_number': [m1, var_res1],\n", - " '5%_number': [m5, var_res5],\n", - " '10%_number': [m10, var_res10]}, orient='index')\n", - "df_knn_number.columns=['diff. list Mean(KNN)', 'diff. list Var.(KNN)']" - ] - }, - { - "cell_type": "markdown", - "id": "41740e20-5dae-403e-a83b-94c91469fcc3", - "metadata": {}, - "source": [ - "### Perform MEAN based imputation" - ] - }, - { - "cell_type": "markdown", - "id": "17b69478-e97c-41b9-828a-eefbb46eb161", - "metadata": {}, - "source": [ - "#### Before mean imputation % missing" - ] - }, - { - "cell_type": "code", - "execution_count": 121, - "id": "5a828216-8f1a-4157-8141-77e6c929f57a", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "number number 0.0\n", - "number_copy_1_percent number_copy_1_percent 1.0\n", - "number_copy_5_percent number_copy_5_percent 5.0\n", - "number_copy_10_percent number_copy_10_percent 10.0\n" - ] - } - ], - "source": [ - "df_number2 = df_number.copy(deep=True)\n", - "print(get_percent_missing(df_number2))" - ] - }, - { - "cell_type": "code", - "execution_count": 122, - "id": "1e137676-9f01-44b9-8a84-50d03a89436b", - "metadata": {}, - "outputs": [], - "source": [ - "df_number2['number_copy_1_percent'] = df_number2['number_copy_1_percent'].fillna(df_number2['number_copy_1_percent'].mean())\n", - "df_number2['number_copy_5_percent'] = df_number2['number_copy_5_percent'].fillna(df_number2['number_copy_5_percent'].mean())\n", - "df_number2['number_copy_10_percent'] = df_number2['number_copy_10_percent'].fillna(df_number2['number_copy_10_percent'].mean())" - ] - }, - { - "cell_type": "markdown", - "id": "8da82021-d96a-46ac-81df-035977cb5497", - "metadata": {}, - "source": [ - "#### After mean impute % missing " - ] - }, - { - "cell_type": "code", - "execution_count": 123, - "id": "669c14bd-f920-47db-8476-1cd1b4f4f5bb", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "number number 0.0\n", - "number_copy_1_percent number_copy_1_percent 0.0\n", - "number_copy_5_percent number_copy_5_percent 0.0\n", - "number_copy_10_percent number_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_number2))" - ] - }, - { - "cell_type": "code", - "execution_count": 124, - "id": "ccb60d18-b24e-4211-9947-46ee0bcc06fe", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
3660.4255250.4255250.4255250.425525
1450.2465890.2465890.2465890.246589
5380.5037010.5037010.5037010.503701
2560.1189010.1189010.4919320.118901
1560.7732150.7732150.7732150.773215
5000.4410870.4410870.4410870.441087
3250.0950680.0950680.0950680.095068
970.2098420.2098420.2098420.487348
9050.1176570.4910840.1176570.117657
2510.9613050.9613050.9613050.961305
\n", - "
" - ], - "text/plain": [ - " number number_copy_1_percent number_copy_5_percent \\\n", - "366 0.425525 0.425525 0.425525 \n", - "145 0.246589 0.246589 0.246589 \n", - "538 0.503701 0.503701 0.503701 \n", - "256 0.118901 0.118901 0.491932 \n", - "156 0.773215 0.773215 0.773215 \n", - "500 0.441087 0.441087 0.441087 \n", - "325 0.095068 0.095068 0.095068 \n", - "97 0.209842 0.209842 0.209842 \n", - "905 0.117657 0.491084 0.117657 \n", - "251 0.961305 0.961305 0.961305 \n", - "\n", - " number_copy_10_percent \n", - "366 0.425525 \n", - "145 0.246589 \n", - "538 0.503701 \n", - "256 0.118901 \n", - "156 0.773215 \n", - "500 0.441087 \n", - "325 0.095068 \n", - "97 0.487348 \n", - "905 0.117657 \n", - "251 0.961305 " - ] - }, - "execution_count": 124, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_number2.sample(10)" - ] - }, - { - "cell_type": "markdown", - "id": "88d89795-0ae9-4f37-89cd-b24d36658588", - "metadata": {}, - "source": [ - "#### Create a list of difference - MEAN" - ] - }, - { - "cell_type": "code", - "execution_count": 125, - "id": "530979d5-52c4-473d-95f3-754c460a7ab6", - "metadata": {}, - "outputs": [], - "source": [ - "# create list of difference bwtween imputed and orginal value\n", - "\n", - "number_diff_1_mean = []\n", - "number_diff_5_mean = []\n", - "number_diff_10_mean = []\n", - "count = 0\n", - "\n", - "for i in number_1_idx:\n", - " count +=1\n", - " diff1 = abs(df_number2['number_copy_1_percent'][i] - df_number2['number'][i])\n", - " number_diff_1_mean.append(diff1)\n", - " \n", - "\n", - "for i in number_5_idx:\n", - " diff5 = abs(df_number2['number_copy_5_percent'][i] - df_number2['number'][i])\n", - " number_diff_5_mean.append(diff5)\n", - "\n", - "for i in number_10_idx:\n", - " diff10 = abs(df_number2['number_copy_10_percent'][i] - df_number2['number'][i])\n", - " number_diff_10_mean.append(diff10)" - ] - }, - { - "cell_type": "code", - "execution_count": 126, - "id": "28dd2494-0175-431e-b4b7-09ee4af1f6a0", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(number_diff_1_mean))\n", - "print(len(number_diff_5_mean))\n", - "print(len(number_diff_10_mean))" - ] - }, - { - "cell_type": "markdown", - "id": "4e90251e-4c0a-4e2d-82b1-8764374aed1c", - "metadata": {}, - "source": [ - "### Calculate the mean and var of the list of differences - MEAN Impute" - ] - }, - { - "cell_type": "code", - "execution_count": 127, - "id": "682bd76e-4875-4b4d-b90b-91d8a6e492ae", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The mean of 1% is 0.269368727544059 and varience 1% is 0.018130331928686818\n", - "The mean of 5% is 0.18484105170274112 and varience 5% is 0.014920933643125705\n", - "The mean of 10% is 0.18484105170274112 and varience 10% is 0.020023889816061954\n" - ] - } - ], - "source": [ - "m1 = sum(number_diff_1_mean) / len(number_diff_1_mean)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res1 = sum((xi - m1) ** 2 for xi in number_diff_1_mean) / len(number_diff_1_mean)\n", - "\n", - "m5 = sum(number_diff_5_mean) / len(number_diff_5_mean)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res5 = sum((xii - m5) ** 2 for xii in number_diff_5_mean) / len(number_diff_5_mean)\n", - "\n", - "\n", - "m10 = sum(number_diff_10_mean) / len(number_diff_10_mean)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res10 = sum((xiii - m10) ** 2 for xiii in number_diff_10_mean) / len(number_diff_10_mean)\n", - "\n", - "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", - "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", - "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 128, - "id": "1f41880d-3e7d-48c9-8744-7e47ccae3c17", - "metadata": {}, - "outputs": [], - "source": [ - "df_MI_number = pd.DataFrame.from_dict({'1%_number': [m1, var_res1],\n", - " '5%_number': [m5, var_res5],\n", - " '10%_number': [m10, var_res10]}, orient='index')\n", - "df_MI_number.columns=['diff. list Mean(MI)', 'diff. list Var.(MI)']" - ] - }, - { - "cell_type": "markdown", - "id": "ec64b079-db97-429c-ae3a-519eec91db3f", - "metadata": {}, - "source": [ - "## KNN and MEAN columns side by side" - ] - }, - { - "cell_type": "code", - "execution_count": 129, - "id": "d74b0e73-e3f0-4107-806d-c5d5a50aab9a", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import display_html\n", - "from itertools import chain,cycle\n", - "def display_side_by_side(*args,titles=cycle([''])):\n", - " html_str=''\n", - " for df,title in zip(args, chain(titles,cycle(['
'])) ):\n", - " html_str+=''\n", - " html_str+=f'

{title}

'\n", - " html_str+=df.to_html().replace('table','table style=\"display:inline\"')\n", - " html_str+=''\n", - " display_html(html_str,raw=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 130, - "id": "747a487f-cbc4-467a-9bc7-b0856dbb6576", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 130, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from IPython.display import display, HTML\n", - "\n", - "CSS = \"\"\"\n", - ".output {\n", - " flex-direction: row;\n", - "}\n", - "\"\"\"\n", - "\n", - "HTML(''.format(CSS))" - ] - }, - { - "cell_type": "code", - "execution_count": 131, - "id": "d24551d1-cd58-4a41-8262-873fe5034272", - "metadata": {}, - "outputs": [], - "source": [ - "# https://github.com/epmoyer/ipy_table/issues/24\n", - "\n", - "from IPython.core.display import HTML\n", - "\n", - "def multi_table(table_list):\n", - " ''' Acceps a list of IpyTable objects and returns a table which contains each IpyTable in a cell\n", - " '''\n", - " return HTML(\n", - " '' + \n", - " ''.join(['' for table in table_list]) +\n", - " '
' + table._repr_html_() + '
'\n", - " )" - ] - }, - { - "cell_type": "code", - "execution_count": 132, - "id": "8a8daa30-3abf-4315-ae58-f9171ff000d5", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[124, 257, 309, 313, 405]\n" - ] - } - ], - "source": [ - "print(number_1_idx[:5])" - ] - }, - { - "cell_type": "code", - "execution_count": 133, - "id": "da6b1646-2417-42b7-bc8f-d3b0be85c61b", - "metadata": {}, - "outputs": [], - "source": [ - "compare_1 = imputed_number_df.loc[:, [\"number\", \"number_copy_1_percent\"]]\n", - "compare_5 = imputed_number_df.loc[:, [\"number\", \"number_copy_5_percent\"]]\n", - "compare_10 = imputed_number_df.loc[:, [\"number\", \"number_copy_10_percent\"]]" - ] - }, - { - "cell_type": "code", - "execution_count": 134, - "id": "380b94cf-264f-4a41-bb1d-ac272354073f", - "metadata": {}, - "outputs": [], - "source": [ - "compare_1_df = compare_1.iloc[number_1_idx]\n", - "compare_5_df = compare_5.iloc[number_5_idx]\n", - "compare_10_df = compare_10.iloc[number_10_idx]" - ] - }, - { - "cell_type": "code", - "execution_count": 135, - "id": "e5b21e71-0ddd-4c60-b931-b384d65230dd", - "metadata": {}, - "outputs": [], - "source": [ - "compare_1_mean = df_number2.loc[:, [\"number\", \"number_copy_1_percent\"]]\n", - "compare_5_mean = df_number2.loc[:, [\"number\", \"number_copy_5_percent\"]]\n", - "compare_10_mean = df_number2.loc[:, [\"number\", \"number_copy_10_percent\"]]" - ] - }, - { - "cell_type": "code", - "execution_count": 136, - "id": "29be3554-8129-4f0c-bad6-1270b7c6c05b", - "metadata": {}, - "outputs": [], - "source": [ - "compare_1_mean_df = compare_1_mean.iloc[number_1_idx]\n", - "compare_5_mean_df = compare_5_mean.iloc[number_5_idx]\n", - "compare_10_mean_df = compare_10_mean.iloc[number_10_idx]" - ] - }, - { - "cell_type": "code", - "execution_count": 137, - "id": "27b96ecc-3566-48f5-bec5-9b073c575cb6", - "metadata": {}, - "outputs": [], - "source": [ - "# display_side_by_side(compare_1_df.head(), compare_1_mean_df.head(), titles=['number 1% KNN Impute','number 1% Mean Impute'])\n", - "# display_side_by_side(compare_5_df.head(), compare_5_mean_df.head(), titles=['number 5% KNN Impute','number 5% Mean Impute'])\n", - "# display_side_by_side(compare_10_df.head(), compare_10_mean_df.head(), titles=['number 10% KNN Impute','number 10% Mean Impute'])" - ] - }, - { - "cell_type": "markdown", - "id": "72a3bc3c-0f91-49ad-bf03-dc4b7ace265d", - "metadata": {}, - "source": [ - "#### **number 1% KNN Impute VS number 1% Mean Impute**" - ] - }, - { - "cell_type": "code", - "execution_count": 138, - "id": "6fd11f89-9f4b-49b3-b114-1ab3b461f180", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_1_percent
1240.1929900.192926
2570.0656020.066172
3090.6614470.663769
3130.9639510.962988
4050.6274600.627545
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_1_percent
1240.1929900.491084
2570.0656020.491084
3090.6614470.491084
3130.9639510.491084
4050.6274600.491084
\n", - "
" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 138, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "multi_table([compare_1_df.head(), compare_1_mean_df.head()])" - ] - }, - { - "cell_type": "markdown", - "id": "e1fc9d1c-53ef-42d3-809b-d68051057e48", - "metadata": {}, - "source": [ - "#### **number 5% KNN Impute VS number 5% Mean Impute**" - ] - }, - { - "cell_type": "code", - "execution_count": 139, - "id": "a97c1530-2e50-48d2-a7e0-89fc70f648e5", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_5_percent
540.4401440.439307
590.1896550.191045
720.4114510.412386
780.2051780.204306
1070.3230970.322044
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_5_percent
540.4401440.491932
590.1896550.491932
720.4114510.491932
780.2051780.491932
1070.3230970.491932
\n", - "
" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 139, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "multi_table([compare_5_df.head(), compare_5_mean_df.head()])" - ] - }, - { - "cell_type": "markdown", - "id": "1e732ac9-faf7-4457-baef-ac9c4976598c", - "metadata": {}, - "source": [ - "#### **number 10% KNN Impute VS number 10% Mean Impute**" - ] - }, - { - "cell_type": "code", - "execution_count": 140, - "id": "f2d22e8f-5a0b-48c0-9150-a391d48e93b2", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_10_percent
220.7981880.798777
470.8614540.861385
490.4451080.446055
680.5574680.557299
690.2311720.230069
\n", - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
numbernumber_copy_10_percent
220.7981880.487348
470.8614540.487348
490.4451080.487348
680.5574680.487348
690.2311720.487348
\n", - "
" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 140, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "multi_table([compare_10_df.head(), compare_10_mean_df.head()])" - ] - }, - { - "cell_type": "markdown", - "id": "cc817314-971f-4abf-a56e-9830a5cf0329", - "metadata": {}, - "source": [ - "# 1.2 Random Numbers dataset Results - KNN and MEAN" - ] - }, - { - "cell_type": "code", - "execution_count": 142, - "id": "1397844d-6757-471c-bd76-ff84d466b150", - "metadata": {}, - "outputs": [], - "source": [ - "results = pd.concat([df_knn_number, df_MI_number])" - ] - }, - { - "cell_type": "code", - "execution_count": 143, - "id": "51868cc7-20f3-499d-a76d-f06f99ea1841", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
diff. list Mean(KNN)diff. list Var.(KNN)diff. list Mean(MI)diff. list Var.(MI)
1%_number0.0007904.568702e-07NaNNaN
5%_number0.0006763.072444e-07NaNNaN
10%_number0.0006482.480609e-07NaNNaN
1%_numberNaNNaN0.2693690.018130
5%_numberNaNNaN0.1848410.014921
10%_numberNaNNaN0.2315010.020024
\n", - "
" - ], - "text/plain": [ - " diff. list Mean(KNN) diff. list Var.(KNN) diff. list Mean(MI) \\\n", - "1%_number 0.000790 4.568702e-07 NaN \n", - "5%_number 0.000676 3.072444e-07 NaN \n", - "10%_number 0.000648 2.480609e-07 NaN \n", - "1%_number NaN NaN 0.269369 \n", - "5%_number NaN NaN 0.184841 \n", - "10%_number NaN NaN 0.231501 \n", - "\n", - " diff. list Var.(MI) \n", - "1%_number NaN \n", - "5%_number NaN \n", - "10%_number NaN \n", - "1%_number 0.018130 \n", - "5%_number 0.014921 \n", - "10%_number 0.020024 " - ] - }, - "execution_count": 143, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "results" - ] - }, - { - "cell_type": "code", - "execution_count": 144, - "id": "85deaebb-3a2b-4b52-bf80-ce31499a70d8", - "metadata": {}, - "outputs": [], - "source": [ - "results.to_csv('random_num_knn_mean_results.csv')" - ] - }, - { - "cell_type": "markdown", - "id": "08586561-e3a5-4d15-a1c0-b8d71731a84a", - "metadata": {}, - "source": [ - "# 2.1 Housing Dataset " - ] - }, - { - "cell_type": "code", - "execution_count": 361, - "id": "c05f4dd5-4cdc-4617-939a-2e22ec859af1", - "metadata": {}, - "outputs": [], - "source": [ - "housing_data = pd.read_csv('https://raw.githubusercontent.com/nikbearbrown/AI_Research_Group/main/Awesome-UCI-Datasets/Classification/House_Price_predication/train.csv')" - ] - }, - { - "cell_type": "code", - "execution_count": 362, - "id": "8564d163-97ce-44da-8d3c-6f8cd9c1d0a1", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
82082160RL72.07226PaveNaNIR1LvlAllPub...0NaNNaNNaN062008WDNormal183000
1390139120RL70.09100PaveNaNRegLvlAllPub...0NaNNaNNaN092006WDNormal235000
535536190RL70.07000PaveNaNRegLvlAllPub...0NaNNaNNaN012008WDNormal107500
12361237160RL36.02628PaveNaNRegLvlAllPub...0NaNNaNNaN062010WDNormal175500
1337133830RM153.04118PaveGrvlIR1BnkAllPub...0NaNNaNNaN032006WDNormal52500
67467520RL80.09200PaveNaNRegLvlAllPub...0NaNNaNNaN072008WDNormal140000
60460520RL88.012803PaveNaNIR1LvlAllPub...0NaNNaNNaN092008WDNormal221000
60560660RL85.013600PaveNaNRegLvlAllPub...0NaNNaNNaN0102009WDNormal205000
1218121950RM52.06240PaveNaNRegLvlAllPub...0NaNNaNNaN072006WDNormal80500
88288360RLNaN9636PaveNaNIR1LvlAllPub...0NaNMnPrvNaN0122009WDNormal178000
\n", - "

10 rows × 81 columns

\n", - "
" - ], - "text/plain": [ - " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", - "820 821 60 RL 72.0 7226 Pave NaN IR1 \n", - "1390 1391 20 RL 70.0 9100 Pave NaN Reg \n", - "535 536 190 RL 70.0 7000 Pave NaN Reg \n", - "1236 1237 160 RL 36.0 2628 Pave NaN Reg \n", - "1337 1338 30 RM 153.0 4118 Pave Grvl IR1 \n", - "674 675 20 RL 80.0 9200 Pave NaN Reg \n", - "604 605 20 RL 88.0 12803 Pave NaN IR1 \n", - "605 606 60 RL 85.0 13600 Pave NaN Reg \n", - "1218 1219 50 RM 52.0 6240 Pave NaN Reg \n", - "882 883 60 RL NaN 9636 Pave NaN IR1 \n", - "\n", - " LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal \\\n", - "820 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "1390 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "535 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "1236 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "1337 Bnk AllPub ... 0 NaN NaN NaN 0 \n", - "674 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "604 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "605 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "1218 Lvl AllPub ... 0 NaN NaN NaN 0 \n", - "882 Lvl AllPub ... 0 NaN MnPrv NaN 0 \n", - "\n", - " MoSold YrSold SaleType SaleCondition SalePrice \n", - "820 6 2008 WD Normal 183000 \n", - "1390 9 2006 WD Normal 235000 \n", - "535 1 2008 WD Normal 107500 \n", - "1236 6 2010 WD Normal 175500 \n", - "1337 3 2006 WD Normal 52500 \n", - "674 7 2008 WD Normal 140000 \n", - "604 9 2008 WD Normal 221000 \n", - "605 10 2009 WD Normal 205000 \n", - "1218 7 2006 WD Normal 80500 \n", - "882 12 2009 WD Normal 178000 \n", - "\n", - "[10 rows x 81 columns]" - ] - }, - "execution_count": 362, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "housing_data.sample(10)" - ] - }, - { - "cell_type": "code", - "execution_count": 363, - "id": "bd81975c-0a21-414b-8e20-3564d35b9f9b", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "663" - ] - }, - "execution_count": 363, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "housing_data['SalePrice'].nunique()" - ] - }, - { - "cell_type": "code", - "execution_count": 364, - "id": "67d1046e-a1ad-412e-a7e8-a0d51729cec7", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "1073" - ] - }, - "execution_count": 364, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "housing_data['LotArea'].nunique()" - ] - }, - { - "cell_type": "code", - "execution_count": 365, - "id": "64b05e52-72dc-4f7d-aca3-d043036b4d2f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "count 1460.000000\n", - "mean 180921.195890\n", - "std 79442.502883\n", - "min 34900.000000\n", - "25% 129975.000000\n", - "50% 163000.000000\n", - "75% 214000.000000\n", - "max 755000.000000\n", - "Name: SalePrice, dtype: float64" - ] - }, - "execution_count": 365, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "housing_data['SalePrice'].describe()" - ] - }, - { - "cell_type": "code", - "execution_count": 366, - "id": "b7e9928c-4785-4ee1-8150-cd0fa1ef3325", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "count 1460.000000\n", - "mean 10516.828082\n", - "std 9981.264932\n", - "min 1300.000000\n", - "25% 7553.500000\n", - "50% 9478.500000\n", - "75% 11601.500000\n", - "max 215245.000000\n", - "Name: LotArea, dtype: float64" - ] - }, - "execution_count": 366, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "housing_data['LotArea'].describe()" - ] - }, - { - "cell_type": "code", - "execution_count": 367, - "id": "20149f80-07dc-4eaa-8d0e-7de6612a7dce", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "Id Id 0.000000\n", - "MSSubClass MSSubClass 0.000000\n", - "MSZoning MSZoning 0.000000\n", - "LotFrontage LotFrontage 17.739726\n", - "LotArea LotArea 0.000000\n", - "Street Street 0.000000\n", - "Alley Alley 93.767123\n", - "LotShape LotShape 0.000000\n", - "LandContour LandContour 0.000000\n", - "Utilities Utilities 0.000000\n", - "LotConfig LotConfig 0.000000\n", - "LandSlope LandSlope 0.000000\n", - "Neighborhood Neighborhood 0.000000\n", - "Condition1 Condition1 0.000000\n", - "Condition2 Condition2 0.000000\n", - "BldgType BldgType 0.000000\n", - "HouseStyle HouseStyle 0.000000\n", - "OverallQual OverallQual 0.000000\n", - "OverallCond OverallCond 0.000000\n", - "YearBuilt YearBuilt 0.000000\n", - "YearRemodAdd YearRemodAdd 0.000000\n", - "RoofStyle RoofStyle 0.000000\n", - "RoofMatl RoofMatl 0.000000\n", - "Exterior1st Exterior1st 0.000000\n", - "Exterior2nd Exterior2nd 0.000000\n", - "MasVnrType MasVnrType 0.547945\n", - "MasVnrArea MasVnrArea 0.547945\n", - "ExterQual ExterQual 0.000000\n", - "ExterCond ExterCond 0.000000\n", - "Foundation Foundation 0.000000\n", - "BsmtQual BsmtQual 2.534247\n", - "BsmtCond BsmtCond 2.534247\n", - "BsmtExposure BsmtExposure 2.602740\n", - "BsmtFinType1 BsmtFinType1 2.534247\n", - "BsmtFinSF1 BsmtFinSF1 0.000000\n", - "BsmtFinType2 BsmtFinType2 2.602740\n", - "BsmtFinSF2 BsmtFinSF2 0.000000\n", - "BsmtUnfSF BsmtUnfSF 0.000000\n", - "TotalBsmtSF TotalBsmtSF 0.000000\n", - "Heating Heating 0.000000\n", - "HeatingQC HeatingQC 0.000000\n", - "CentralAir CentralAir 0.000000\n", - "Electrical Electrical 0.068493\n", - "1stFlrSF 1stFlrSF 0.000000\n", - "2ndFlrSF 2ndFlrSF 0.000000\n", - "LowQualFinSF LowQualFinSF 0.000000\n", - "GrLivArea GrLivArea 0.000000\n", - "BsmtFullBath BsmtFullBath 0.000000\n", - "BsmtHalfBath BsmtHalfBath 0.000000\n", - "FullBath FullBath 0.000000\n", - "HalfBath HalfBath 0.000000\n", - "BedroomAbvGr BedroomAbvGr 0.000000\n", - "KitchenAbvGr KitchenAbvGr 0.000000\n", - "KitchenQual KitchenQual 0.000000\n", - "TotRmsAbvGrd TotRmsAbvGrd 0.000000\n", - "Functional Functional 0.000000\n", - "Fireplaces Fireplaces 0.000000\n", - "FireplaceQu FireplaceQu 47.260274\n", - "GarageType GarageType 5.547945\n", - "GarageYrBlt GarageYrBlt 5.547945\n", - "GarageFinish GarageFinish 5.547945\n", - "GarageCars GarageCars 0.000000\n", - "GarageArea GarageArea 0.000000\n", - "GarageQual GarageQual 5.547945\n", - "GarageCond GarageCond 5.547945\n", - "PavedDrive PavedDrive 0.000000\n", - "WoodDeckSF WoodDeckSF 0.000000\n", - "OpenPorchSF OpenPorchSF 0.000000\n", - "EnclosedPorch EnclosedPorch 0.000000\n", - "3SsnPorch 3SsnPorch 0.000000\n", - "ScreenPorch ScreenPorch 0.000000\n", - "PoolArea PoolArea 0.000000\n", - "PoolQC PoolQC 99.520548\n", - "Fence Fence 80.753425\n", - "MiscFeature MiscFeature 96.301370\n", - "MiscVal MiscVal 0.000000\n", - "MoSold MoSold 0.000000\n", - "YrSold YrSold 0.000000\n", - "SaleType SaleType 0.000000\n", - "SaleCondition SaleCondition 0.000000\n", - "SalePrice SalePrice 0.000000\n" - ] - } - ], - "source": [ - "pd.set_option('display.max_rows', None)\n", - "print(get_percent_missing(housing_data))" - ] - }, - { - "cell_type": "markdown", - "id": "c8eb3ee3-085d-4b41-9a5f-c83a3805f870", - "metadata": {}, - "source": [ - "#### Using Sale price coloumn for KNN and MEAN imputation task" - ] - }, - { - "cell_type": "markdown", - "id": "451c79fb-17ba-40ac-8f0b-87a8b2ec4837", - "metadata": {}, - "source": [ - "#### Non Scaled dataframe Sale Price - take first 1000 rows" - ] - }, - { - "cell_type": "code", - "execution_count": 368, - "id": "9cc1f97f-1b24-4570-8f6a-30426bd79269", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
0208500208500208500208500
1181500181500181500181500
2223500223500223500223500
3140000140000140000140000
4250000250000250000250000
\n", - "
" - ], - "text/plain": [ - " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", - "0 208500 208500 208500 208500\n", - "1 181500 181500 181500 181500\n", - "2 223500 223500 223500 223500\n", - "3 140000 140000 140000 140000\n", - "4 250000 250000 250000 250000" - ] - }, - "execution_count": 368, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_saleprice = housing_data[['SalePrice']][:1000]\n", - "df_saleprice['sp_copy_1_percent'] = df_saleprice[['SalePrice']]\n", - "df_saleprice['sp_copy_5_percent'] = df_saleprice[['SalePrice']]\n", - "df_saleprice['sp_copy_10_percent'] = df_saleprice[['SalePrice']]\n", - "df_saleprice.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 369, - "id": "f462f065-9f37-44f1-a22e-92e610dae2e9", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "1000" - ] - }, - "execution_count": 369, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "len(df_saleprice)" - ] - }, - { - "cell_type": "markdown", - "id": "03407bbd-f8a7-4f6c-a7c3-64a865ed3f7e", - "metadata": {}, - "source": [ - "#### Scaled Dataframe SalePrice - take first 1000 rows" - ] - }, - { - "cell_type": "code", - "execution_count": 370, - "id": "e461b1ef-df2c-410f-aea8-abe954fa9afd", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
00.2410780.2410780.2410780.241078
10.2035830.2035830.2035830.203583
20.2619080.2619080.2619080.261908
30.1459520.1459520.1459520.145952
40.2987090.2987090.2987090.298709
\n", - "
" - ], - "text/plain": [ - " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", - "0 0.241078 0.241078 0.241078 0.241078\n", - "1 0.203583 0.203583 0.203583 0.203583\n", - "2 0.261908 0.261908 0.261908 0.261908\n", - "3 0.145952 0.145952 0.145952 0.145952\n", - "4 0.298709 0.298709 0.298709 0.298709" - ] - }, - "execution_count": 370, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "scaler = MinMaxScaler()\n", - "df_saleprice_scaled = df_saleprice.copy(deep=True)\n", - "df_saleprice_scaled = pd.DataFrame(scaler.fit_transform(df_saleprice_scaled), columns = df_saleprice_scaled.columns)\n", - "df_saleprice_scaled.head()" - ] - }, - { - "cell_type": "markdown", - "id": "a66683c4-f66a-4aa1-ab8a-f28087b60b6c", - "metadata": {}, - "source": [ - "#### Check % missing values in this dataframe" - ] - }, - { - "cell_type": "code", - "execution_count": 371, - "id": "0075fa0f-4b82-4089-ab81-e5282497c4a3", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 0.0\n", - "sp_copy_5_percent sp_copy_5_percent 0.0\n", - "sp_copy_10_percent sp_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice))" - ] - }, - { - "cell_type": "markdown", - "id": "619ef99f-55c0-422c-aaa8-73cd71fcf2fb", - "metadata": {}, - "source": [ - "#### Create 1%, 5% and 10% missing data" - ] - }, - { - "cell_type": "code", - "execution_count": 372, - "id": "82df5098-4176-4fba-922f-ca84c0466f2a", - "metadata": {}, - "outputs": [], - "source": [ - "create_missing(df_saleprice, 0.01, 'sp_copy_1_percent')\n", - "create_missing(df_saleprice, 0.05, 'sp_copy_5_percent')\n", - "create_missing(df_saleprice, 0.1, 'sp_copy_10_percent')" - ] - }, - { - "cell_type": "code", - "execution_count": 373, - "id": "0e90ae04-cd10-4507-a851-c187010f0be0", - "metadata": {}, - "outputs": [], - "source": [ - "create_missing(df_saleprice_scaled, 0.01, 'sp_copy_1_percent')\n", - "create_missing(df_saleprice_scaled, 0.05, 'sp_copy_5_percent')\n", - "create_missing(df_saleprice_scaled, 0.1, 'sp_copy_10_percent')" - ] - }, - { - "cell_type": "markdown", - "id": "a8237a82-5a33-4ce9-b4c7-a48ede4f5fef", - "metadata": {}, - "source": [ - "#### With/Without scaling dataframe missing values check" - ] - }, - { - "cell_type": "code", - "execution_count": 374, - "id": "2794306d-89c7-4518-8979-9edb3d9441b1", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 1.0\n", - "sp_copy_5_percent sp_copy_5_percent 5.0\n", - "sp_copy_10_percent sp_copy_10_percent 10.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice))" - ] - }, - { - "cell_type": "code", - "execution_count": 375, - "id": "8351dbe2-b388-451d-9238-52c4ccabd425", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 1.0\n", - "sp_copy_5_percent sp_copy_5_percent 5.0\n", - "sp_copy_10_percent sp_copy_10_percent 10.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice_scaled))" - ] - }, - { - "cell_type": "code", - "execution_count": 376, - "id": "b11b093f-110b-4ef3-9d00-ac4fed45a956", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "10" - ] - }, - "execution_count": 376, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_saleprice['sp_copy_1_percent'].isna().sum()" - ] - }, - { - "cell_type": "markdown", - "id": "360e0010-e085-435c-8902-80c6a7ea78be", - "metadata": {}, - "source": [ - "#### Store indices of missing values" - ] - }, - { - "cell_type": "code", - "execution_count": 377, - "id": "e546096c-ce35-448e-aa97-0943d3535a87", - "metadata": {}, - "outputs": [], - "source": [ - "# Store Index of NaN values in each coloumns\n", - "sp_1_idx = list(np.where(df_saleprice['sp_copy_1_percent'].isna())[0])\n", - "sp_5_idx = list(np.where(df_saleprice['sp_copy_5_percent'].isna())[0])\n", - "sp_10_idx = list(np.where(df_saleprice['sp_copy_10_percent'].isna())[0])" - ] - }, - { - "cell_type": "code", - "execution_count": 378, - "id": "d409e2a5-b3a9-4ae1-9b17-88b7c642692d", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(sp_1_idx))\n", - "print(len(sp_5_idx))\n", - "print(len(sp_10_idx))" - ] - }, - { - "cell_type": "code", - "execution_count": 379, - "id": "5839460a-e736-42e9-9a13-d5bab5683115", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Length of sp_1_idx is 10 and it contains 1.0% of total data in column | Total rows: 1000\n", - "Length of sp_5_idx is 50 and it contains 5.0% of total data in column | Total rows: 1000\n", - "Length of sp_10_idx is 100 and it contains 10.0% of total data in column | Total rows: 1000\n" - ] - } - ], - "source": [ - "print(f\"Length of sp_1_idx is {len(sp_1_idx)} and it contains {(len(sp_1_idx)/len(df_saleprice['sp_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")\n", - "print(f\"Length of sp_5_idx is {len(sp_5_idx)} and it contains {(len(sp_5_idx)/len(df_saleprice['sp_copy_5_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")\n", - "print(f\"Length of sp_10_idx is {len(sp_10_idx)} and it contains {(len(sp_10_idx)/len(df_saleprice['sp_copy_10_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")" - ] - }, - { - "cell_type": "markdown", - "id": "c1464c79-c0a9-4640-92dd-f0d5131634ab", - "metadata": {}, - "source": [ - "### Perform KNN to df_saleprice and df_saleprice_scaled dataframe" - ] - }, - { - "cell_type": "code", - "execution_count": 380, - "id": "08fa2436-ffb8-4b5d-a7a1-9e2d63b14562", - "metadata": {}, - "outputs": [], - "source": [ - "df_saleprice1 = df_saleprice.copy(deep=True)\n", - "imputer = KNNImputer(n_neighbors=5)\n", - "imputed_saleprice_df = pd.DataFrame(imputer.fit_transform(df_saleprice1), columns = df_saleprice1.columns)" - ] - }, - { - "cell_type": "code", - "execution_count": 381, - "id": "205c7a96-3f1c-42a4-91de-f22f15ce9cb2", - "metadata": {}, - "outputs": [], - "source": [ - "df_saleprice_scaled1 = df_saleprice_scaled.copy(deep=True)\n", - "imputer = KNNImputer(n_neighbors=5)\n", - "imputed_saleprice_scaled_df = pd.DataFrame(imputer.fit_transform(df_saleprice_scaled1), columns = df_saleprice_scaled1.columns)" - ] - }, - { - "cell_type": "code", - "execution_count": 382, - "id": "a482f58d-73b6-423c-b97a-140884830a0f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
0208500.0208500.0208500.0208500.0
1181500.0181500.0181500.0181500.0
2223500.0223500.0223500.0223500.0
3140000.0140000.0140000.0140000.0
4250000.0250000.0250000.0250000.0
\n", - "
" - ], - "text/plain": [ - " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", - "0 208500.0 208500.0 208500.0 208500.0\n", - "1 181500.0 181500.0 181500.0 181500.0\n", - "2 223500.0 223500.0 223500.0 223500.0\n", - "3 140000.0 140000.0 140000.0 140000.0\n", - "4 250000.0 250000.0 250000.0 250000.0" - ] - }, - "execution_count": 382, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "imputed_saleprice_df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 383, - "id": "11f8f5ff-f06d-4ec2-a4e3-1324e807a537", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
00.2410780.2410780.2408550.241078
10.2035830.2035830.2035830.203583
20.2619080.2619080.2619080.261908
30.1459520.1459520.1459520.145952
40.2987090.2987090.2987090.298709
\n", - "
" - ], - "text/plain": [ - " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", - "0 0.241078 0.241078 0.240855 0.241078\n", - "1 0.203583 0.203583 0.203583 0.203583\n", - "2 0.261908 0.261908 0.261908 0.261908\n", - "3 0.145952 0.145952 0.145952 0.145952\n", - "4 0.298709 0.298709 0.298709 0.298709" - ] - }, - "execution_count": 383, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "imputed_saleprice_scaled_df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "d9fd7fa1-4ce0-43be-9955-55ef759d930b", - "metadata": {}, - "source": [ - "#### Check % missing in saleprice and saleprice_scaled DF" - ] - }, - { - "cell_type": "code", - "execution_count": 384, - "id": "9ed0d36a-9584-4e3b-9201-2ac36827bce9", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 0.0\n", - "sp_copy_5_percent sp_copy_5_percent 0.0\n", - "sp_copy_10_percent sp_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(imputed_saleprice_df))" - ] - }, - { - "cell_type": "code", - "execution_count": 385, - "id": "7c842fce-bbd5-4c2c-bb1a-db5df92f6315", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 0.0\n", - "sp_copy_5_percent sp_copy_5_percent 0.0\n", - "sp_copy_10_percent sp_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(imputed_saleprice_scaled_df))" - ] - }, - { - "cell_type": "markdown", - "id": "ac47abb1-df5f-4686-bc67-6617140c008c", - "metadata": {}, - "source": [ - "#### Store the list of disfferences between Org. and Imputed Value" - ] - }, - { - "cell_type": "code", - "execution_count": 386, - "id": "99e04554-568d-4efa-a110-768b50dfaee6", - "metadata": {}, - "outputs": [], - "source": [ - "# create list of difference bwtween imputed and orginal value\n", - "\n", - "sp_diff_1 = []\n", - "sp_diff_5 = []\n", - "sp_diff_10 = []\n", - "count = 0\n", - "\n", - "for i in sp_1_idx:\n", - " count +=1\n", - " diff1 = abs(imputed_saleprice_df['sp_copy_1_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", - " sp_diff_1.append(diff1)\n", - " \n", - "\n", - "for i in sp_5_idx:\n", - " diff5 = abs(imputed_saleprice_df['sp_copy_5_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", - " sp_diff_5.append(diff5)\n", - "\n", - "for i in sp_10_idx:\n", - " diff10 = abs(imputed_saleprice_df['sp_copy_10_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", - " sp_diff_10.append(diff10)" - ] - }, - { - "cell_type": "code", - "execution_count": 387, - "id": "92204f8a-497c-470d-a770-59165d226cc9", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(sp_diff_1))\n", - "print(len(sp_diff_5))\n", - "print(len(sp_diff_10))" - ] - }, - { - "cell_type": "code", - "execution_count": 388, - "id": "b8875fff-0289-4dd9-92c1-78dc9b730d22", - "metadata": {}, - "outputs": [], - "source": [ - "# create list of difference bwtween imputed and orginal value\n", - "\n", - "sp_scaled_diff_1 = []\n", - "sp_scaled_diff_5 = []\n", - "sp_scaled_diff_10 = []\n", - "count = 0\n", - "\n", - "for i in sp_1_idx:\n", - " count +=1\n", - " diff1 = abs(imputed_saleprice_scaled_df['sp_copy_1_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", - " sp_scaled_diff_1.append(diff1)\n", - " \n", - "\n", - "for i in sp_5_idx:\n", - " diff5 = abs(imputed_saleprice_scaled_df['sp_copy_5_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", - " sp_scaled_diff_5.append(diff5)\n", - "\n", - "for i in sp_10_idx:\n", - " diff10 = abs(imputed_saleprice_scaled_df['sp_copy_10_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", - " sp_scaled_diff_10.append(diff10)" - ] - }, - { - "cell_type": "code", - "execution_count": 389, - "id": "40192344-79a4-444c-a12a-2201dc5aa0c1", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(sp_scaled_diff_1))\n", - "print(len(sp_scaled_diff_5))\n", - "print(len(sp_scaled_diff_10))" - ] - }, - { - "cell_type": "code", - "execution_count": 390, - "id": "a95bd45c-8a2f-4159-8306-399ec18a4c0f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[0.0, 0.0, 0.0, 0.0, 0.0]" - ] - }, - "execution_count": 390, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sp_scaled_diff_1[:5]" - ] - }, - { - "cell_type": "code", - "execution_count": 391, - "id": "0f73d420-8842-4062-ae17-158a0a25e169", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[10.0, 20.0, 80.0, 220.0, 0.0]" - ] - }, - "execution_count": 391, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sp_diff_1[:5]" - ] - }, - { - "cell_type": "markdown", - "id": "a40fd400-913b-4011-b0b9-dd3ca0d5827a", - "metadata": {}, - "source": [ - "#### Calculate the mean and var of list of diff. KNN - SalePrice" - ] - }, - { - "cell_type": "code", - "execution_count": 392, - "id": "80267827-7f73-49ff-b200-27cdb2963756", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The mean of 1% is 170.0 and varience 1% is 42400.0\n", - "The mean of 5% is 444.9439999999997 and varience 5% is 2554554.1584639903\n", - "The mean of 10% is 444.9439999999997 and varience 10% is 6304766.8341439795\n" - ] - } - ], - "source": [ - "m1 = sum(sp_diff_1) / len(sp_diff_1)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res1 = sum((xi - m1) ** 2 for xi in sp_diff_1) / len(sp_diff_1)\n", - "\n", - "m5 = sum(sp_diff_5) / len(sp_diff_5)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res5 = sum((xii - m5) ** 2 for xii in sp_diff_5) / len(sp_diff_5)\n", - "\n", - "\n", - "m10 = sum(sp_diff_10) / len(sp_diff_10)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_diff_10) / len(sp_diff_10)\n", - "\n", - "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", - "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", - "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 393, - "id": "358545ff-2fcf-4c99-9049-4eaf6dd110bd", - "metadata": {}, - "outputs": [], - "source": [ - "df_knn_saleprice = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", - " '5%_saleprice': [m5, var_res5],\n", - " '10%_saleprice': [m10, var_res10]}, orient='index')\n", - "df_knn_saleprice.columns=['diff. list Mean(KNN)', 'diff. list Var.(KNN)']" - ] - }, - { - "cell_type": "code", - "execution_count": 394, - "id": "3714c8f9-58db-40a7-b5a2-6bb7e788b734", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
diff. list Mean(KNN)diff. list Var.(KNN)
1%_saleprice170.0004.240000e+04
5%_saleprice444.9442.554554e+06
10%_saleprice564.7846.304767e+06
\n", - "
" - ], - "text/plain": [ - " diff. list Mean(KNN) diff. list Var.(KNN)\n", - "1%_saleprice 170.000 4.240000e+04\n", - "5%_saleprice 444.944 2.554554e+06\n", - "10%_saleprice 564.784 6.304767e+06" - ] - }, - "execution_count": 394, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_knn_saleprice" - ] - }, - { - "cell_type": "markdown", - "id": "fd7608a8-c5fb-425c-a340-af01801ee349", - "metadata": {}, - "source": [ - "#### Calculate the mean and var of list of diff. KNN - SalePrice scaled" - ] - }, - { - "cell_type": "code", - "execution_count": 395, - "id": "bb03017f-3d91-48d9-8ebf-7cb5c25fadc3", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The mean of 1% is 0.0 and varience 1% is 0.0\n", - "The mean of 5% is 2.6301902513541363e-05 and varience 5% is 2.134349753649814e-08\n", - "The mean of 10% is 2.6301902513541363e-05 and varience 10% is 1.417383473391258e-08\n" - ] - } - ], - "source": [ - "m1 = sum(sp_scaled_diff_1) / len(sp_scaled_diff_1)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res1 = sum((xi - m1) ** 2 for xi in sp_scaled_diff_1) / len(sp_scaled_diff_1)\n", - "\n", - "m5 = sum(sp_scaled_diff_5) / len(sp_scaled_diff_5)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res5 = sum((xii - m5) ** 2 for xii in sp_scaled_diff_5) / len(sp_scaled_diff_5)\n", - "\n", - "\n", - "m10 = sum(sp_scaled_diff_10) / len(sp_scaled_diff_10)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_scaled_diff_10) / len(sp_scaled_diff_10)\n", - "\n", - "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", - "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", - "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 396, - "id": "290d8db2-c9f4-4028-ab44-ad68c9e7b3c5", - "metadata": {}, - "outputs": [], - "source": [ - "df_knn_saleprice_scaled = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", - " '5%_saleprice': [m5, var_res5],\n", - " '10%_saleprice': [m10, var_res10]}, orient='index')\n", - "df_knn_saleprice_scaled.columns=['diff. list Mean(KNN) scaled', 'diff. list Var.(KNN) scaled']" - ] - }, - { - "cell_type": "code", - "execution_count": 397, - "id": "89347fd7-d87d-42bb-b375-a75417c395de", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
diff. list Mean(KNN) scaleddiff. list Var.(KNN) scaled
1%_saleprice0.0000000.000000e+00
5%_saleprice0.0000262.134350e-08
10%_saleprice0.0000321.417383e-08
\n", - "
" - ], - "text/plain": [ - " diff. list Mean(KNN) scaled diff. list Var.(KNN) scaled\n", - "1%_saleprice 0.000000 0.000000e+00\n", - "5%_saleprice 0.000026 2.134350e-08\n", - "10%_saleprice 0.000032 1.417383e-08" - ] - }, - "execution_count": 397, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_knn_saleprice_scaled" - ] - }, - { - "cell_type": "markdown", - "id": "c984dc69-f85f-4f1b-8c94-4afb48c1c8db", - "metadata": {}, - "source": [ - "### Perform MEAN imputation" - ] - }, - { - "cell_type": "code", - "execution_count": 398, - "id": "008bc14f-45e7-42d8-b843-2fee7bcf26c2", - "metadata": {}, - "outputs": [], - "source": [ - "df_saleprice2 = df_saleprice.copy(deep=True)\n", - "df_saleprice_scaled2 = df_saleprice_scaled.copy(deep=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 399, - "id": "bd71dc1a-f137-46ed-bf2b-f3d87fd4b6a0", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 1.0\n", - "sp_copy_5_percent sp_copy_5_percent 5.0\n", - "sp_copy_10_percent sp_copy_10_percent 10.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice2))" - ] - }, - { - "cell_type": "code", - "execution_count": 400, - "id": "46237cfd-6361-466f-b66f-32f5940149d6", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 1.0\n", - "sp_copy_5_percent sp_copy_5_percent 5.0\n", - "sp_copy_10_percent sp_copy_10_percent 10.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice_scaled2))" - ] - }, - { - "cell_type": "markdown", - "id": "64465299-5620-47b9-a28d-afb5494f279e", - "metadata": {}, - "source": [ - "#### Impute Mean values in missing for saleprice and saleprice_scaled" - ] - }, - { - "cell_type": "code", - "execution_count": 401, - "id": "28cf6b75-eebf-4758-94ec-4b3536f2c659", - "metadata": {}, - "outputs": [], - "source": [ - "df_saleprice2['sp_copy_1_percent'] = df_saleprice2['sp_copy_1_percent'].fillna(df_saleprice2['sp_copy_1_percent'].mean())\n", - "df_saleprice2['sp_copy_5_percent'] = df_saleprice2['sp_copy_5_percent'].fillna(df_saleprice2['sp_copy_5_percent'].mean())\n", - "df_saleprice2['sp_copy_10_percent'] = df_saleprice2['sp_copy_10_percent'].fillna(df_saleprice2['sp_copy_10_percent'].mean())" - ] - }, - { - "cell_type": "code", - "execution_count": 402, - "id": "2409dd8c-3cd0-4742-b0ac-14dea1fdb504", - "metadata": {}, - "outputs": [], - "source": [ - "df_saleprice_scaled2['sp_copy_1_percent'] = df_saleprice_scaled2['sp_copy_1_percent'].fillna(df_saleprice_scaled2['sp_copy_1_percent'].mean())\n", - "df_saleprice_scaled2['sp_copy_5_percent'] = df_saleprice_scaled2['sp_copy_5_percent'].fillna(df_saleprice_scaled2['sp_copy_5_percent'].mean())\n", - "df_saleprice_scaled2['sp_copy_10_percent'] = df_saleprice_scaled2['sp_copy_10_percent'].fillna(df_saleprice_scaled2['sp_copy_10_percent'].mean())" - ] - }, - { - "cell_type": "markdown", - "id": "62377754-b682-45e5-8faa-1a4a186bd3c7", - "metadata": {}, - "source": [ - "#### After MEAN imputation - Saleprice and saleprice scaled" - ] - }, - { - "cell_type": "code", - "execution_count": 403, - "id": "6c448556-55f4-4685-aed2-6b67d5ad8a2a", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 0.0\n", - "sp_copy_5_percent sp_copy_5_percent 0.0\n", - "sp_copy_10_percent sp_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice2))" - ] - }, - { - "cell_type": "code", - "execution_count": 404, - "id": "d9775fbf-7a72-4352-b446-488e9d25b6a2", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " column_name percent_missing\n", - "SalePrice SalePrice 0.0\n", - "sp_copy_1_percent sp_copy_1_percent 0.0\n", - "sp_copy_5_percent sp_copy_5_percent 0.0\n", - "sp_copy_10_percent sp_copy_10_percent 0.0\n" - ] - } - ], - "source": [ - "print(get_percent_missing(df_saleprice_scaled2))" - ] - }, - { - "cell_type": "code", - "execution_count": 407, - "id": "136f87e6-a4af-4229-b36a-695f712deee5", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
571120000120000.0120000.000000182343.817778
2223500223500.0223500.000000223500.000000
313375000375000.0375000.000000375000.000000
377340000340000.0182457.342105182343.817778
987395192395192.0395192.000000395192.000000
\n", - "
" - ], - "text/plain": [ - " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", - "571 120000 120000.0 120000.000000 182343.817778\n", - "2 223500 223500.0 223500.000000 223500.000000\n", - "313 375000 375000.0 375000.000000 375000.000000\n", - "377 340000 340000.0 182457.342105 182343.817778\n", - "987 395192 395192.0 395192.000000 395192.000000" - ] - }, - "execution_count": 407, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_saleprice2.sample(5)" - ] - }, - { - "cell_type": "code", - "execution_count": 409, - "id": "784cb61c-78f8-4b31-b709-379c50024dca", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
2160.2431610.2431610.2431610.243161
10.2035830.2035830.2035830.203583
5750.1160950.1160950.1160950.116095
3970.1869180.1869180.1869180.205253
7030.1459520.1459520.1459520.145952
\n", - "
" - ], - "text/plain": [ - " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", - "216 0.243161 0.243161 0.243161 0.243161\n", - "1 0.203583 0.203583 0.203583 0.203583\n", - "575 0.116095 0.116095 0.116095 0.116095\n", - "397 0.186918 0.186918 0.186918 0.205253\n", - "703 0.145952 0.145952 0.145952 0.145952" - ] - }, - "execution_count": 409, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_saleprice_scaled2.sample(5)" - ] - }, - { - "cell_type": "markdown", - "id": "33c1f3b7-5afc-45cb-8b43-9682ec87156d", - "metadata": {}, - "source": [ - "#### Create List of differences for saleprice and saleprice_scaled Dataframes" - ] - }, - { - "cell_type": "code", - "execution_count": 410, - "id": "d2faf410-f83e-4ccb-89d4-e6f8c7adffbb", - "metadata": {}, - "outputs": [], - "source": [ - "# create list of difference bwtween imputed and orginal value\n", - "\n", - "sp_mean_diff_1 = []\n", - "sp_mean_diff_5 = []\n", - "sp_mean_diff_10 = []\n", - "count = 0\n", - "\n", - "for i in sp_1_idx:\n", - " count +=1\n", - " diff1 = abs(df_saleprice2['sp_copy_1_percent'][i] - df_saleprice2['SalePrice'][i])\n", - " sp_mean_diff_1.append(diff1)\n", - " \n", - "\n", - "for i in sp_5_idx:\n", - " diff5 = abs(df_saleprice2['sp_copy_5_percent'][i] - df_saleprice2['SalePrice'][i])\n", - " sp_mean_diff_5.append(diff5)\n", - "\n", - "for i in sp_10_idx:\n", - " diff10 = abs(df_saleprice2['sp_copy_10_percent'][i] - df_saleprice2['SalePrice'][i])\n", - " sp_mean_diff_10.append(diff10)" - ] - }, - { - "cell_type": "code", - "execution_count": 411, - "id": "789b07c5-530a-4111-8c97-f5297f7da5e4", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(sp_mean_diff_1))\n", - "print(len(sp_mean_diff_5))\n", - "print(len(sp_mean_diff_10))" - ] - }, - { - "cell_type": "code", - "execution_count": 412, - "id": "4fec222c-2420-41af-9e2a-d9773e1d6259", - "metadata": {}, - "outputs": [], - "source": [ - "# create list of difference bwtween imputed and orginal value\n", - "\n", - "sp_scaled_mean_diff_1 = []\n", - "sp_scaled_mean_diff_5 = []\n", - "sp_scaled_mean_diff_10 = []\n", - "count = 0\n", - "\n", - "for i in sp_1_idx:\n", - " count +=1\n", - " diff1 = abs(df_saleprice_scaled2['sp_copy_1_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", - " sp_scaled_mean_diff_1.append(diff1)\n", - " \n", - "\n", - "for i in sp_5_idx:\n", - " diff5 = abs(df_saleprice_scaled2['sp_copy_5_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", - " sp_scaled_mean_diff_5.append(diff5)\n", - "\n", - "for i in sp_10_idx:\n", - " diff10 = abs(df_saleprice_scaled2['sp_copy_10_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", - " sp_scaled_mean_diff_10.append(diff10)" - ] - }, - { - "cell_type": "code", - "execution_count": 413, - "id": "de9bf1de-68fe-4894-915a-7069b386123f", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "50\n", - "100\n" - ] - } - ], - "source": [ - "print(len(sp_scaled_mean_diff_1))\n", - "print(len(sp_scaled_mean_diff_5))\n", - "print(len(sp_scaled_mean_diff_10))" - ] - }, - { - "cell_type": "markdown", - "id": "f7b93757-d1a7-41a1-85fa-3ee77734be5b", - "metadata": {}, - "source": [ - "#### Calculate mean and var of list of diff. - MEAN impute SalePrice" - ] - }, - { - "cell_type": "code", - "execution_count": 414, - "id": "c60d3aad-33f0-48f4-8bb0-f8af45e33e1e", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The mean of 1% is 55971.63676767676 and varience 1% is 1103367192.190047\n", - "The mean of 5% is 58478.24210526314 and varience 5% is 3139731297.2794733\n", - "The mean of 10% is 58478.24210526314 and varience 10% is 3846674638.263318\n" - ] - } - ], - "source": [ - "m1 = sum(sp_mean_diff_1) / len(sp_mean_diff_1)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res1 = sum((xi - m1) ** 2 for xi in sp_mean_diff_1) / len(sp_mean_diff_1)\n", - "\n", - "m5 = sum(sp_mean_diff_5) / len(sp_mean_diff_5)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res5 = sum((xii - m5) ** 2 for xii in sp_mean_diff_5) / len(sp_mean_diff_5)\n", - "\n", - "\n", - "m10 = sum(sp_mean_diff_10) / len(sp_mean_diff_10)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_mean_diff_10) / len(sp_mean_diff_10)\n", - "\n", - "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", - "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", - "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 415, - "id": "e7f6e5cf-4eaa-4bfe-add2-fc7f600941b7", - "metadata": {}, - "outputs": [], - "source": [ - "df_mean_saleprice = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", - " '5%_saleprice': [m5, var_res5],\n", - " '10%_saleprice': [m10, var_res10]}, orient='index')\n", - "df_mean_saleprice.columns=['diff. list Mean(MI)', 'diff. list Var.(MI)']" - ] - }, - { - "cell_type": "code", - "execution_count": 416, - "id": "cc37eeaf-e3cd-4a83-870d-fab7037eeffe", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
diff. list Mean(MI)diff. list Var.(MI)
1%_saleprice55971.6367681.103367e+09
5%_saleprice58478.2421053.139731e+09
10%_saleprice61028.7099113.846675e+09
\n", - "
" - ], - "text/plain": [ - " diff. list Mean(MI) diff. list Var.(MI)\n", - "1%_saleprice 55971.636768 1.103367e+09\n", - "5%_saleprice 58478.242105 3.139731e+09\n", - "10%_saleprice 61028.709911 3.846675e+09" - ] - }, - "execution_count": 416, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_mean_saleprice" - ] - }, - { - "cell_type": "markdown", - "id": "f405f073-1b45-47e8-873b-7a9d34ad0e5c", - "metadata": {}, - "source": [ - "#### Calculate mean and var of list of diff. - MEAN impute SalePrice scaled" - ] - }, - { - "cell_type": "code", - "execution_count": 417, - "id": "2516b4f7-6b79-4636-9bd5-0738343ea355", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The mean of 1% is 0.0 and varience 1% is 0.0\n", - "The mean of 5% is 0.00893610697344667 and varience 5% is 0.0014044730755095036\n", - "The mean of 10% is 0.00893610697344667 and varience 10% is 0.0004431848362889144\n" - ] - } - ], - "source": [ - "m1 = sum(sp_scaled_mean_diff_1) / len(sp_scaled_mean_diff_1)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res1 = sum((xi - m1) ** 2 for xi in sp_scaled_mean_diff_1) / len(sp_scaled_mean_diff_1)\n", - "\n", - "m5 = sum(sp_scaled_mean_diff_5) / len(sp_scaled_mean_diff_5)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res5 = sum((xii - m5) ** 2 for xii in sp_scaled_mean_diff_5) / len(sp_scaled_mean_diff_5)\n", - "\n", - "\n", - "m10 = sum(sp_scaled_mean_diff_10) / len(sp_scaled_mean_diff_10)\n", - "\n", - "# calculate variance using a list comprehension\n", - "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_scaled_mean_diff_10) / len(sp_scaled_mean_diff_10)\n", - "\n", - "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", - "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", - "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 418, - "id": "fe6a93b8-d6cb-4d7d-856b-ab4ee8fe78fc", - "metadata": {}, - "outputs": [], - "source": [ - "df_mean_saleprice_scaled = pd.DataFrame.from_dict({'1%_saleprice_scaled': [m1, var_res1],\n", - " '5%_saleprice_scaled': [m5, var_res5],\n", - " '10%_saleprice_scaled': [m10, var_res10]}, orient='index')\n", - "df_mean_saleprice_scaled.columns=['diff. list Mean(MI) scaled', 'diff. list Var.(MI) scaled']" - ] - }, - { - "cell_type": "code", - "execution_count": 419, - "id": "e74c35ed-7c2d-44ab-b6c2-4d81c2c6b6bb", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
diff. list Mean(MI) scaleddiff. list Var.(MI) scaled
1%_saleprice_scaled0.0000000.000000
5%_saleprice_scaled0.0089360.001404
10%_saleprice_scaled0.0074920.000443
\n", - "
" - ], - "text/plain": [ - " diff. list Mean(MI) scaled diff. list Var.(MI) scaled\n", - "1%_saleprice_scaled 0.000000 0.000000\n", - "5%_saleprice_scaled 0.008936 0.001404\n", - "10%_saleprice_scaled 0.007492 0.000443" - ] - }, - "execution_count": 419, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_mean_saleprice_scaled" - ] - }, - { - "cell_type": "markdown", - "id": "876b979a-f5c4-43a7-9ead-d5d866bef078", - "metadata": {}, - "source": [ - "# 2.2 Housing Data Results - KNN and MEAN" - ] - }, - { - "cell_type": "code", - "execution_count": 420, - "id": "fea4b521-03a3-46ce-b217-27225eb868af", - "metadata": {}, - "outputs": [], - "source": [ - "results1 = pd.concat([df_knn_saleprice, df_knn_saleprice_scaled, df_mean_saleprice, df_mean_saleprice_scaled])" - ] - }, - { - "cell_type": "code", - "execution_count": 421, - "id": "631729d6-e853-4ba5-b5fd-4e632ec00d5f", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
diff. list Mean(KNN)diff. list Var.(KNN)diff. list Mean(KNN) scaleddiff. list Var.(KNN) scaleddiff. list Mean(MI)diff. list Var.(MI)diff. list Mean(MI) scaleddiff. list Var.(MI) scaled
1%_saleprice170.0004.240000e+04NaNNaNNaNNaNNaNNaN
5%_saleprice444.9442.554554e+06NaNNaNNaNNaNNaNNaN
10%_saleprice564.7846.304767e+06NaNNaNNaNNaNNaNNaN
1%_salepriceNaNNaN0.0000000.000000e+00NaNNaNNaNNaN
5%_salepriceNaNNaN0.0000262.134350e-08NaNNaNNaNNaN
10%_salepriceNaNNaN0.0000321.417383e-08NaNNaNNaNNaN
1%_salepriceNaNNaNNaNNaN55971.6367681.103367e+09NaNNaN
5%_salepriceNaNNaNNaNNaN58478.2421053.139731e+09NaNNaN
10%_salepriceNaNNaNNaNNaN61028.7099113.846675e+09NaNNaN
1%_saleprice_scaledNaNNaNNaNNaNNaNNaN0.0000000.000000
5%_saleprice_scaledNaNNaNNaNNaNNaNNaN0.0089360.001404
10%_saleprice_scaledNaNNaNNaNNaNNaNNaN0.0074920.000443
\n", - "
" - ], - "text/plain": [ - " diff. list Mean(KNN) diff. list Var.(KNN) \\\n", - "1%_saleprice 170.000 4.240000e+04 \n", - "5%_saleprice 444.944 2.554554e+06 \n", - "10%_saleprice 564.784 6.304767e+06 \n", - "1%_saleprice NaN NaN \n", - "5%_saleprice NaN NaN \n", - "10%_saleprice NaN NaN \n", - "1%_saleprice NaN NaN \n", - "5%_saleprice NaN NaN \n", - "10%_saleprice NaN NaN \n", - "1%_saleprice_scaled NaN NaN \n", - "5%_saleprice_scaled NaN NaN \n", - "10%_saleprice_scaled NaN NaN \n", - "\n", - " diff. list Mean(KNN) scaled \\\n", - "1%_saleprice NaN \n", - "5%_saleprice NaN \n", - "10%_saleprice NaN \n", - "1%_saleprice 0.000000 \n", - "5%_saleprice 0.000026 \n", - "10%_saleprice 0.000032 \n", - "1%_saleprice NaN \n", - "5%_saleprice NaN \n", - "10%_saleprice NaN \n", - "1%_saleprice_scaled NaN \n", - "5%_saleprice_scaled NaN \n", - "10%_saleprice_scaled NaN \n", - "\n", - " diff. list Var.(KNN) scaled diff. list Mean(MI) \\\n", - "1%_saleprice NaN NaN \n", - "5%_saleprice NaN NaN \n", - "10%_saleprice NaN NaN \n", - "1%_saleprice 0.000000e+00 NaN \n", - "5%_saleprice 2.134350e-08 NaN \n", - "10%_saleprice 1.417383e-08 NaN \n", - "1%_saleprice NaN 55971.636768 \n", - "5%_saleprice NaN 58478.242105 \n", - "10%_saleprice NaN 61028.709911 \n", - "1%_saleprice_scaled NaN NaN \n", - "5%_saleprice_scaled NaN NaN \n", - "10%_saleprice_scaled NaN NaN \n", - "\n", - " diff. list Var.(MI) diff. list Mean(MI) scaled \\\n", - "1%_saleprice NaN NaN \n", - "5%_saleprice NaN NaN \n", - "10%_saleprice NaN NaN \n", - "1%_saleprice NaN NaN \n", - "5%_saleprice NaN NaN \n", - "10%_saleprice NaN NaN \n", - "1%_saleprice 1.103367e+09 NaN \n", - "5%_saleprice 3.139731e+09 NaN \n", - "10%_saleprice 3.846675e+09 NaN \n", - "1%_saleprice_scaled NaN 0.000000 \n", - "5%_saleprice_scaled NaN 0.008936 \n", - "10%_saleprice_scaled NaN 0.007492 \n", - "\n", - " diff. list Var.(MI) scaled \n", - "1%_saleprice NaN \n", - "5%_saleprice NaN \n", - "10%_saleprice NaN \n", - "1%_saleprice NaN \n", - "5%_saleprice NaN \n", - "10%_saleprice NaN \n", - "1%_saleprice NaN \n", - "5%_saleprice NaN \n", - "10%_saleprice NaN \n", - "1%_saleprice_scaled 0.000000 \n", - "5%_saleprice_scaled 0.001404 \n", - "10%_saleprice_scaled 0.000443 " - ] - }, - "execution_count": 421, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "results1" - ] - }, - { - "cell_type": "code", - "execution_count": 422, - "id": "a255c5bc-c062-4029-8f18-0c7644ca1d7c", - "metadata": {}, - "outputs": [], - "source": [ - "results1.to_csv('housing_data_saleprice_KNN_Mean_results.csv')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c9b0060e-129c-465e-a2a5-c3113ac4b936", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "pytorch_kz_env", - "language": "python", - "name": "pytorch_kz_env" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.11" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From ab3b01a4dadad4dcec559b3cb2458adde6a09b50 Mon Sep 17 00:00:00 2001 From: Shesh Narayan Gupta <91396937+SheshNGupta@users.noreply.github.com> Date: Tue, 7 Jun 2022 21:30:00 -0400 Subject: [PATCH 7/8] Not required anymore --- .../random_numbers_1000.csv | 1001 ----------------- 1 file changed, 1001 deletions(-) delete mode 100644 notebooks/Imputation_best_practices/random_numbers_1000.csv diff --git a/notebooks/Imputation_best_practices/random_numbers_1000.csv b/notebooks/Imputation_best_practices/random_numbers_1000.csv deleted file mode 100644 index b988bad..0000000 --- a/notebooks/Imputation_best_practices/random_numbers_1000.csv +++ /dev/null @@ -1,1001 +0,0 @@ -,number -0,0.14461602473455892 -1,0.07751503129173953 -2,0.15593297226701996 -3,0.09720879582042008 -4,0.32375017402684214 -5,0.686823745565341 -6,0.7068035159437503 -7,0.9167216890721541 -8,0.6352048775376901 -9,0.17132904054220055 -10,0.8159661332230377 -11,0.16475992352396795 -12,0.0409370627667629 -13,0.16726651783050572 -14,0.9709841404608549 -15,0.7314646963631376 -16,0.3426860074270154 -17,0.03452867763070577 -18,0.3574832521777054 -19,0.5745017180628896 -20,0.9464018964648249 -21,0.17346442317598176 -22,0.7981877585797893 -23,0.7809787573425518 -24,0.5238193208352585 -25,0.7821735568253659 -26,0.9934482007890996 -27,0.4184423331593896 -28,0.2599014381523176 -29,0.79832254805514 -30,0.6041862665264831 -31,0.3819864440431342 -32,0.8521701748665009 -33,0.3126469510739037 -34,0.573165703657289 -35,0.6265563684951247 -36,0.739416657331853 -37,0.012060677103418738 -38,0.9526287180476393 -39,0.3919187115227588 -40,0.2638910529614693 -41,0.28055121530104343 -42,0.5573435702875359 -43,0.810470016341365 -44,0.5595615325523974 -45,0.408760756112558 -46,0.8630495060594643 -47,0.8614542990838314 -48,0.8236790421079785 -49,0.445107982060686 -50,0.9240480240430241 -51,0.17212099430841699 -52,0.2821871607285322 -53,0.37501938886942654 -54,0.4401439635045862 -55,0.1316322082815632 -56,0.06144638522796442 -57,0.9719025725097523 -58,0.6437628611013991 -59,0.18965508288943556 -60,0.06647339880458658 -61,0.9432875072199843 -62,0.9635593500723799 -63,0.8159138106628153 -64,0.5268141359426226 -65,0.8097577290919002 -66,0.10832871122562193 -67,0.513926863373751 -68,0.5574679474011387 -69,0.23117155673924017 -70,0.7988683863124257 -71,0.14232155967666804 -72,0.4114506075996932 -73,0.028703811806714996 -74,0.15511224785648736 -75,0.5179635133770123 -76,0.6343922699321491 -77,0.5442703351502044 -78,0.2051777299642784 -79,0.9514959457303863 -80,0.8616963431169906 -81,0.9260797192939593 -82,0.6837050092238902 -83,0.6341651538285088 -84,0.47009701258761005 -85,0.6290009641921982 -86,0.9976095248457479 -87,0.6766165875739423 -88,0.34775785853790764 -89,0.24721164403263118 -90,0.7644613432516099 -91,0.8578411267105046 -92,0.02847593788616165 -93,0.7352417508308864 -94,0.6439666934556955 -95,0.4145386388213331 -96,0.9000774058908544 -97,0.20984159212668807 -98,0.5736834527493817 -99,0.5731814122745401 -100,0.39175113064248857 -101,0.9414042225202869 -102,0.35865018640717594 -103,0.34942147114579614 -104,0.6287322577319368 -105,0.5640558939154473 -106,0.9935619072485498 -107,0.3230972874260011 -108,0.30050448033239197 -109,0.8535359869169682 -110,0.8186071691655027 -111,0.8507126794809163 -112,0.11848293702439716 -113,0.34039997170201786 -114,0.24848934681272938 -115,0.8713564278618446 -116,0.7192981378269337 -117,0.5612771185476495 -118,0.3001718489057721 -119,0.5582566234063182 -120,0.20715922789136187 -121,0.24718349962906172 -122,0.9096809353144786 -123,0.9496126251594162 -124,0.19298962232482253 -125,0.6823143045816399 -126,0.2950869303839806 -127,0.700872866143569 -128,0.9246255564110638 -129,0.3918411220739513 -130,0.5046695500081352 -131,0.40242035593564884 -132,0.5348070625842399 -133,0.6190144238291141 -134,0.6527067332418969 -135,0.7798534811708006 -136,0.8371435153002993 -137,0.7256654504898371 -138,0.19486710733751433 -139,0.17061227388763445 -140,0.3866266766943538 -141,0.9861342050546121 -142,0.12499832976125236 -143,0.4076100289319884 -144,0.24405060656519562 -145,0.24658924623282708 -146,0.31303910086742404 -147,0.13582549628998997 -148,0.4267352707490074 -149,0.6860815270131422 -150,0.2632104445655937 -151,0.7095899448677616 -152,0.30697391312148903 -153,0.15020764760355143 -154,0.33008237434926957 -155,0.24730791798017127 -156,0.7732146302465086 -157,0.3986960975344779 -158,0.878302550945857 -159,0.3073561016445441 -160,0.21123045619113257 -161,0.5806664509148879 -162,0.8984369263318096 -163,0.8363942698985983 -164,0.2812623945509036 -165,0.10724622968453401 -166,0.5703943012638906 -167,0.7309007201275504 -168,0.6865969394598082 -169,0.17355862259884247 -170,0.41747139600619776 -171,0.8046329781439144 -172,0.29734663284924356 -173,0.6874907011989809 -174,0.27926268019676004 -175,0.16857167772740067 -176,0.808320103826969 -177,0.22397888146185907 -178,0.4961137292567884 -179,0.39791460648438426 -180,0.749624236829485 -181,0.8166672255804612 -182,0.5416591595071085 -183,0.7784968348980786 -184,0.5246274130247313 -185,0.6165788811775392 -186,0.18993747860389354 -187,0.4375903866391334 -188,0.8977799452863308 -189,0.8974808404906014 -190,0.7833353163003136 -191,0.5735505446147654 -192,0.8592478266591742 -193,0.555628191461239 -194,0.29218190018690193 -195,0.6823254024415241 -196,0.7253556992028032 -197,0.6348979373592366 -198,0.738955355288769 -199,0.40548956817360793 -200,0.9965074549246696 -201,0.6680475408833246 -202,0.4753087000915296 -203,0.8154531729554498 -204,0.39674071637462927 -205,0.3465424212251109 -206,0.3010873336265142 -207,0.3453059844140016 -208,0.3376649450698975 -209,0.4520568726021712 -210,0.7102711170123417 -211,0.5676304992868505 -212,0.246451823758292 -213,0.3045971494873321 -214,0.9191799326806603 -215,0.09062317707388845 -216,0.6456030768852257 -217,0.8145182625891805 -218,0.3502989381872097 -219,0.5454669053640021 -220,0.9229510982790893 -221,0.5017605011244138 -222,0.5814298938642755 -223,0.212077064497179 -224,0.9084673048697015 -225,0.8420689009087419 -226,0.09544595716628035 -227,0.5428219386076877 -228,0.334040059452826 -229,0.5883742904617911 -230,0.6681527250828868 -231,0.920066967991107 -232,0.6980014815164323 -233,0.5140583511099508 -234,0.574062901794968 -235,0.8671650796521554 -236,0.29309281744572635 -237,0.6255644089859125 -238,0.41377688075614283 -239,0.6541722779053092 -240,0.7022455597573617 -241,0.7027961835253476 -242,0.32866027307469425 -243,0.9438823677034145 -244,0.6392304917718383 -245,0.35610068008813955 -246,0.5109988272940061 -247,0.7549785046509206 -248,0.911498498846909 -249,0.7269132750864981 -250,0.43346849143235944 -251,0.9613052659398792 -252,0.06410207161162618 -253,0.7224542800953787 -254,0.8605028822342475 -255,0.9379303538857604 -256,0.11890111097053702 -257,0.06560232272410749 -258,0.9815175258058294 -259,0.5816233574934034 -260,0.3223771211316614 -261,0.010794999021216611 -262,0.48232848210912416 -263,0.6888091652734284 -264,0.7510123953710294 -265,0.3931342633771988 -266,0.4285185942589612 -267,0.028804295777431044 -268,0.7471054611787746 -269,0.5188475627728396 -270,0.3699806335289325 -271,0.6733240981418717 -272,0.455659972278607 -273,0.8865920570538507 -274,0.9773310825483524 -275,0.9114683092627319 -276,0.7234740743957591 -277,0.47378640650570536 -278,0.9044322182580692 -279,0.6490971485609244 -280,0.9325706015784121 -281,0.15806103989245135 -282,0.20431604755502109 -283,0.9516960107212825 -284,0.17933034496530176 -285,0.10632943259433447 -286,0.20529052976827733 -287,0.26644977396966907 -288,0.990842732357776 -289,0.6626056375310618 -290,0.8934023242009224 -291,0.6087787761836707 -292,0.6622123753279109 -293,0.2795715500728444 -294,0.7356211918792761 -295,0.023450952083761578 -296,0.29930766895885463 -297,0.9605253146799532 -298,0.4773205356946918 -299,0.896685482640458 -300,0.20788119046629716 -301,0.21907107928738412 -302,0.3417751133430835 -303,0.8785812995819484 -304,0.7629857606713326 -305,0.10409839946928867 -306,0.5375122454578438 -307,0.12610808266796247 -308,0.9207106566062669 -309,0.6614470367535862 -310,0.6646296886200127 -311,0.02517423927343887 -312,0.5355435671395777 -313,0.9639505712726043 -314,0.8427700240424094 -315,0.5173256280251634 -316,0.6809361625916177 -317,0.25269387981635383 -318,7.39014254360626e-05 -319,0.6832379417409375 -320,0.3814705574477538 -321,0.2953366513034189 -322,0.8601629667491553 -323,0.4116625534183441 -324,0.20248827761656263 -325,0.0950677170887495 -326,0.37432668808858527 -327,0.5002586204770462 -328,0.5903766299860601 -329,0.4069147751233232 -330,0.46587616114566655 -331,0.20767274566478722 -332,0.4405095567714371 -333,0.7561490702983013 -334,0.9691510044256642 -335,0.9835349892112961 -336,0.08167974686852508 -337,0.011831197129136273 -338,0.2533369151703784 -339,0.7258386397040382 -340,0.1533224004672512 -341,0.16976063838308353 -342,0.3535761067133554 -343,0.9558080514913609 -344,0.34787269425215606 -345,0.6384858181781367 -346,0.19142808499268715 -347,0.3723886499126876 -348,0.4610104267479409 -349,0.7386414627232165 -350,0.5547224736511918 -351,0.07560627992824742 -352,0.38543929036328295 -353,0.023870001618478964 -354,0.08490118558975879 -355,0.9523181200843006 -356,0.835121255953561 -357,0.8313253101018512 -358,0.4477164423027221 -359,0.427173224834863 -360,0.2607502696316568 -361,0.6518880149684392 -362,0.989596091701078 -363,0.4737188317675711 -364,0.951663574431818 -365,0.6389835611029937 -366,0.4255250760028354 -367,0.36494823219306194 -368,0.10394871793754767 -369,0.08787887115953141 -370,0.05185866702404662 -371,0.5729228447658512 -372,0.3557153056497062 -373,0.14169200930635462 -374,0.6026259214704931 -375,0.6780938325392907 -376,0.0019220493053816456 -377,0.14423401505903843 -378,0.31021740847078616 -379,0.26542859991807166 -380,0.05293698137098246 -381,0.5447383348415423 -382,0.19410883367100906 -383,0.2759766462115508 -384,0.6085305795585376 -385,0.19018564330800136 -386,0.6001023952936514 -387,0.5500869240450543 -388,0.308558554189692 -389,0.613015054522192 -390,0.5053671279653127 -391,0.8033565610860482 -392,0.3190316438196028 -393,0.8430688477494918 -394,0.3907441626865247 -395,0.3749010705929905 -396,0.20374147066354986 -397,0.4445572005828903 -398,0.4325615226381033 -399,0.747347832034453 -400,0.1408237945119577 -401,0.5629196065967164 -402,0.8883715667513505 -403,0.7262344816634011 -404,0.1015240156369166 -405,0.6274596622730756 -406,0.6724938834493908 -407,0.45890555605876826 -408,0.253862163313197 -409,0.20213399227024142 -410,0.9431472444002996 -411,0.4412716272261822 -412,0.6778537756613036 -413,0.5609208700560778 -414,0.7852790417028147 -415,0.8301487622409094 -416,0.0695242591856422 -417,0.5342345164968271 -418,0.020198821857018268 -419,0.11932836566667071 -420,0.7351542137502673 -421,0.879354084852934 -422,0.060390921051916124 -423,0.3517659280158124 -424,0.25831407832342757 -425,0.25041309629182773 -426,0.6324032934179679 -427,0.6905116746744266 -428,0.038781141504878325 -429,0.11872222658971077 -430,0.3402172182577837 -431,0.1117834948318035 -432,0.8974663997148172 -433,0.7721061886641211 -434,0.467763325594456 -435,0.45960484726135 -436,0.11940893902740168 -437,0.8892320824757846 -438,0.056170722740824464 -439,0.8348974660229447 -440,0.8328276290445746 -441,0.015421942378315512 -442,0.6078039146470725 -443,0.9797170916017848 -444,0.817871594488278 -445,0.4281570072853328 -446,0.9826586617461194 -447,0.5714323337805088 -448,0.5655480118995616 -449,0.13163751508874266 -450,0.5727166298844355 -451,0.3876989055629705 -452,0.24625748760449773 -453,0.062376725489559304 -454,0.1868295868142189 -455,0.07519337399332371 -456,0.8615125038568271 -457,0.0430765434686432 -458,0.7784279481001283 -459,0.1559200654309939 -460,0.28457480300272475 -461,0.4833371043049315 -462,0.21688560355701902 -463,0.051055375260327884 -464,0.8764119752087609 -465,0.03830180552041673 -466,0.899276170682331 -467,0.5326669068942715 -468,0.7966592760107886 -469,0.5977938689767619 -470,0.35735055753216216 -471,0.7502306585594846 -472,0.27262195939610845 -473,0.3367003915054816 -474,0.3718378858875636 -475,0.7252726856566986 -476,0.6108078470654391 -477,0.160140124957443 -478,0.640641195165919 -479,0.819043970313203 -480,0.9460930077740923 -481,0.3955113176387407 -482,0.08228064172201954 -483,0.5692148152461914 -484,0.9379027430417781 -485,0.7262721958954546 -486,0.9974714724600596 -487,0.9816411645054782 -488,0.02801478549452141 -489,0.35876394018958924 -490,0.46224300725504386 -491,0.07977812492324099 -492,0.7825821331768681 -493,0.7728747320072956 -494,0.18411522733742114 -495,0.9349933626453013 -496,0.3305156463539396 -497,0.05247324921620988 -498,0.3784435570491954 -499,0.8296025413407634 -500,0.44108727645927825 -501,0.2993358032378495 -502,0.8631126359025391 -503,0.250262827945147 -504,0.09566738091105942 -505,0.7130474946994906 -506,0.2235781443128807 -507,0.7026149405611689 -508,0.7224945548679957 -509,0.6170012611217315 -510,0.20186432914831431 -511,0.7852714452298651 -512,0.8903242744728199 -513,0.1399056906045737 -514,0.17026945833848617 -515,0.514586763470415 -516,0.9736100357614889 -517,0.7746591507784915 -518,0.29437001890274195 -519,0.8027253084378705 -520,0.08386991518130038 -521,0.09136100092018629 -522,0.8983567502463687 -523,0.8868693311046169 -524,0.533466309836137 -525,0.42900189716927073 -526,0.1821870276409372 -527,0.4315150943786541 -528,0.47383956070476785 -529,0.42647315825719867 -530,0.20889106515275513 -531,0.15615589390655582 -532,0.7683598815481214 -533,0.8407774935346721 -534,0.4599058924434972 -535,0.20858605861422153 -536,0.25419023941340724 -537,0.03537597137641857 -538,0.5037011171417803 -539,0.319855948227728 -540,0.6143932185624659 -541,0.11338109816795006 -542,0.6071773224023549 -543,0.6320103598568474 -544,0.17739418618305125 -545,0.9193076779462215 -546,0.539317629461803 -547,0.361121293498606 -548,0.8225521587592494 -549,0.037067189096233966 -550,0.7644376889628157 -551,0.9614375433647248 -552,0.26247829558958613 -553,0.04497704041286332 -554,0.49347237237561237 -555,0.10135820428850206 -556,0.9054759324635467 -557,0.3912479745377101 -558,0.16984308812935767 -559,0.3130327921420567 -560,0.2845393861009978 -561,0.7216547111114262 -562,0.6129838442158642 -563,0.6128072542663652 -564,0.5153838338789999 -565,0.7131085367862817 -566,0.8713477772442941 -567,0.9419360672901563 -568,0.9061770339937525 -569,0.9973713503589123 -570,0.6511737928834931 -571,0.0980714039543844 -572,0.12371358453480508 -573,0.5817580949438432 -574,0.3878197750090975 -575,0.3836838844640248 -576,0.3330772932400339 -577,0.8937920239990277 -578,0.42660379831271933 -579,0.09749777821209016 -580,0.03273234283716975 -581,0.5822939987582022 -582,0.2818759219290342 -583,0.9973773382690185 -584,0.3485811650096795 -585,0.38385951065171464 -586,0.14314846321555819 -587,0.41168484188278187 -588,0.5560325831949468 -589,0.6786651527115524 -590,0.27941662328630534 -591,0.12758615070559087 -592,0.8706880276786881 -593,0.42247163006009736 -594,0.8747921784321767 -595,0.9819789489386005 -596,0.53212913612486 -597,0.6820548577830702 -598,0.14172556124342628 -599,0.8954903213991394 -600,0.8877895505948118 -601,0.2899734461911796 -602,0.39888758518426926 -603,0.5085270928974726 -604,0.5397323464650328 -605,0.5355595876880633 -606,0.6680045600991499 -607,0.07890855054344348 -608,0.36522753036507116 -609,0.7525828516063231 -610,0.8155334605307646 -611,0.948872329161571 -612,0.10085424156574552 -613,0.3063104444859259 -614,0.012248867459916157 -615,0.8332405266792986 -616,0.4477328006875678 -617,0.7381760858313725 -618,0.5381307278002123 -619,0.64442652761133 -620,0.407653279216153 -621,0.988120343671508 -622,0.349242158981631 -623,0.11439639275168989 -624,0.773600974105568 -625,0.3422508667504136 -626,0.35092901992304426 -627,0.6998555631853256 -628,0.5351463864628954 -629,0.6941915466139217 -630,0.27550090759498 -631,0.03955870654832727 -632,0.9737612333749457 -633,0.85659566451438 -634,0.318016024519294 -635,0.07264967870375483 -636,0.6266672136646679 -637,0.5427530067840908 -638,0.08013357115177333 -639,0.27865447324993387 -640,0.8204327600278204 -641,0.6472338718548233 -642,0.8981066937808309 -643,0.9904134149156683 -644,0.7570648348954108 -645,0.04820939759809295 -646,0.49659488586991385 -647,0.2681871451946377 -648,0.05376519761698151 -649,0.1536101940376925 -650,0.2458849441738461 -651,0.19991898782481343 -652,0.49815295225863154 -653,0.7475145062482099 -654,0.5814474904248211 -655,0.9103815228294841 -656,0.8091439841662771 -657,0.044556478634595 -658,0.06582839484468272 -659,0.8723124347377673 -660,0.761407419742959 -661,0.6295611439582762 -662,0.5602756647971817 -663,0.028833108636930782 -664,0.6925154173449602 -665,0.30781547100300766 -666,0.9456746547718861 -667,0.7733519530494579 -668,0.07325928323474962 -669,0.06051359621130603 -670,0.7684091239449635 -671,0.0772898478864189 -672,0.4652145959688888 -673,0.4373876627767307 -674,0.6267684478070814 -675,0.7183418633741062 -676,0.28256468766217413 -677,0.5073826011665699 -678,0.31820311938601464 -679,0.4089168748142118 -680,0.29885921770184043 -681,0.03372851278925548 -682,0.6703170306185748 -683,0.33198869826189814 -684,0.5975405123566822 -685,0.8211657963714585 -686,0.3461079054656666 -687,0.48616250243415104 -688,0.13447950866733605 -689,0.562667191415577 -690,0.7678216928305848 -691,0.4530052286033409 -692,0.5010228200975811 -693,0.4323309760765164 -694,0.36743023729184987 -695,0.1723991626473217 -696,0.4337302869241262 -697,0.24966845326719822 -698,0.642167289966723 -699,0.616830008851879 -700,0.7703637450499222 -701,0.21386173939654995 -702,0.704115745850898 -703,0.6905967742396926 -704,0.14550064889741277 -705,0.6045853103312959 -706,0.03670533871021342 -707,0.7158949195594291 -708,0.5963326610400751 -709,0.7656919572130952 -710,0.16593604258736716 -711,0.37116447793513807 -712,0.8005826062394383 -713,0.041771054650389106 -714,0.6847846478124059 -715,0.4993883882765534 -716,0.1850707225574446 -717,0.5630874044249621 -718,0.37025234599378876 -719,0.7107125656980158 -720,0.4118677519270143 -721,0.7742568360649871 -722,0.8100159822588088 -723,0.3174629757017041 -724,0.5303493054894146 -725,0.8849961235045513 -726,0.3273403729546115 -727,0.6172150375830504 -728,0.15983060531231819 -729,0.4728594510763161 -730,0.4529506215548965 -731,0.5035430872599636 -732,0.004927231548344402 -733,0.1940383807540148 -734,0.14982458424309364 -735,0.8563549025851751 -736,0.03884058951015723 -737,0.28522238435867453 -738,0.8057900651211597 -739,0.03021709036511122 -740,0.07224489509195386 -741,0.056610587902518716 -742,0.9264467821014194 -743,0.8138662549320123 -744,0.41783822642927937 -745,0.8723047253359363 -746,0.18136207963463802 -747,0.7164025688996778 -748,0.8196872616954788 -749,0.8068822585021751 -750,0.007129291396152926 -751,0.2602504030386925 -752,0.46370562857123043 -753,0.163784347412389 -754,0.23315134483036648 -755,0.6177440123966893 -756,0.2561521510607473 -757,0.562548076892661 -758,0.5051861935336659 -759,0.13892890236963107 -760,0.004539613445676105 -761,0.17372524036846493 -762,0.6832015932759417 -763,0.8325857535808265 -764,6.826981312790803e-05 -765,0.19612584863473537 -766,0.4145509719106246 -767,0.2619625834737831 -768,0.24549665294458467 -769,0.27612714237335956 -770,0.8531795517703349 -771,0.047146001044882424 -772,0.562788499298586 -773,0.43099863376962144 -774,0.26050958743406505 -775,0.7788002061420074 -776,0.6743332176478016 -777,0.40066992822420555 -778,0.9760876856806906 -779,0.539119034171984 -780,0.18208901259127885 -781,0.12376735142175199 -782,0.9551514655114575 -783,0.7810294736400567 -784,0.9212583468427701 -785,0.8010043139785669 -786,0.22944051406680832 -787,0.050052241727377766 -788,0.6786745563768194 -789,0.429793629888368 -790,0.42563361699182967 -791,0.6784838537337905 -792,0.2858761720399675 -793,0.2890895011305119 -794,0.025121632825633844 -795,0.25765509253553054 -796,0.43572322499776717 -797,0.6647102169428171 -798,0.10847616026636064 -799,0.2537450603718995 -800,0.24416864473064126 -801,0.0672514263787497 -802,0.16935229953659314 -803,0.27439580112524253 -804,0.4284736191801598 -805,0.8586734606964571 -806,0.4315781202007021 -807,0.09915635234890208 -808,0.44899905032025744 -809,0.013316716483281699 -810,0.8391449274551819 -811,0.5061770521104294 -812,0.0672045714638001 -813,0.2933544809181752 -814,0.18022127393582965 -815,0.8781136361676581 -816,0.5157135259800142 -817,0.46243072336418334 -818,0.6222491687600095 -819,0.8889053056935484 -820,0.04571095891205823 -821,0.1513640763692672 -822,0.7774449453314359 -823,0.5183880690457242 -824,0.2921720252636122 -825,0.09168278609192515 -826,0.39002371887786735 -827,0.3580585061283823 -828,0.12047021435718164 -829,0.6738337221623005 -830,0.21958552211366156 -831,0.5648142473736366 -832,0.23497653874753555 -833,0.16544595712611387 -834,0.040561694693181605 -835,0.7355715205459343 -836,0.9004365787736869 -837,0.5459151013055901 -838,0.7480058346265005 -839,0.7141260383574005 -840,0.1158157631511092 -841,0.9125379342891712 -842,0.3680018768100638 -843,0.7402206231811581 -844,0.2972738079840226 -845,0.8923504613507662 -846,0.5063568640229354 -847,0.24619949696371157 -848,0.5399981903000146 -849,0.7188539530946122 -850,0.648195890336554 -851,0.724518894463568 -852,0.14288147919479144 -853,0.7994514226699949 -854,0.6226355760247099 -855,0.010176035425188967 -856,0.4131692686695717 -857,0.834692399566853 -858,0.49912957372925004 -859,0.00438814293685974 -860,0.3252041908817417 -861,0.534840233118543 -862,0.3587118743837924 -863,0.9677560902733098 -864,0.5973183201684436 -865,0.296691425381007 -866,0.5855079326424412 -867,0.20240300955532187 -868,0.6021550529096645 -869,0.8824421051967469 -870,0.3072946199859422 -871,0.3128979438155097 -872,0.5475105438225643 -873,0.4842448962628426 -874,0.15025538438496855 -875,0.310622456701922 -876,0.6023436011138587 -877,0.5754165898365287 -878,0.6577607923072721 -879,0.7857515237431592 -880,0.22057576301022253 -881,0.8661095076438114 -882,0.910244039608377 -883,0.578456971142587 -884,0.3787935162597653 -885,0.08939098828841929 -886,0.9232626564888574 -887,0.1712490756353049 -888,0.779216672902944 -889,0.3495372334946847 -890,0.47001887737996617 -891,0.29750226759355936 -892,0.2810128485470573 -893,0.2437794575755069 -894,0.2624381305719474 -895,0.8246608579175856 -896,0.6942956761673141 -897,0.11515579868519688 -898,0.1206162339748359 -899,0.26196220525263014 -900,0.5553026135773536 -901,0.40720637901420265 -902,0.9638145298530792 -903,0.4117628415691498 -904,0.31618951259604455 -905,0.11765701103218917 -906,0.33470652854411564 -907,0.7366235956449027 -908,0.7581529716898141 -909,0.9554767313213507 -910,0.8837680591214232 -911,0.12426303151941864 -912,0.13192594906673982 -913,0.13159583337236658 -914,0.8413301780622977 -915,0.5495370639785346 -916,0.8125566245605387 -917,0.764454058143039 -918,0.9022709587116715 -919,0.22879685531861071 -920,0.49057430203325403 -921,0.4724960647844604 -922,0.8055598260756343 -923,0.7603094118394911 -924,0.3728373302689516 -925,0.3568389711535207 -926,0.4241494594670866 -927,0.7538918294606227 -928,0.5278021541536974 -929,0.4605573424438759 -930,0.6738635250250887 -931,0.16054005910324365 -932,0.8428762894592794 -933,0.9518468101445031 -934,0.32776599980321264 -935,0.3459454626103713 -936,0.08290510118997685 -937,0.4134429089919419 -938,0.7577633137424186 -939,0.4360752405153524 -940,0.977898855124461 -941,0.3899549115493246 -942,0.07360874043480192 -943,0.6234394805204561 -944,0.8281399000229284 -945,0.5936401403938281 -946,0.9444301233719021 -947,0.18311569423561358 -948,0.19900897833219744 -949,0.5859537329420677 -950,0.45369641243149117 -951,0.8140494291811821 -952,0.15504116789135103 -953,0.5097058344234562 -954,0.46015129255339193 -955,0.9168374769143446 -956,0.6646855362668478 -957,0.08710995188842596 -958,0.9648211892689712 -959,0.3099412950871465 -960,0.4182764603873177 -961,0.2811470272374724 -962,0.36150098707209977 -963,0.7547921114548144 -964,0.038441021458981206 -965,0.6114605284345398 -966,0.20333754648264146 -967,0.6879693726518868 -968,0.5615887399000671 -969,0.10931708773465398 -970,0.8275712918793767 -971,0.7747109160797243 -972,0.9005913428689535 -973,0.6399242580079716 -974,0.717434307883715 -975,0.0782758727785875 -976,0.05968847507483932 -977,0.9824576958211914 -978,0.02495988725135534 -979,0.2620968894854523 -980,0.010107863826380292 -981,0.2764875736254404 -982,0.18403412415931986 -983,0.1616789092290818 -984,0.3454521050417132 -985,0.433499552863608 -986,0.040911884966301715 -987,0.20484238883308725 -988,0.6675520566953549 -989,0.6160709258598361 -990,0.04474552091720452 -991,0.40241951588041347 -992,0.5873473825076658 -993,0.38212818142632543 -994,0.8770948644179681 -995,0.18210726703943658 -996,0.7879879363150989 -997,0.14870738186047538 -998,0.15312132054135852 -999,0.4747372545447177 From 4828ce2c7dccacdde373abe509432d37ac6dfc1f Mon Sep 17 00:00:00 2001 From: Shesh Narayan Gupta <91396937+SheshNGupta@users.noreply.github.com> Date: Tue, 7 Jun 2022 21:30:33 -0400 Subject: [PATCH 8/8] Updated Notebook with review comments incorporated --- .../Imputation_best_practices.ipynb | 4475 +++++++++++++++++ 1 file changed, 4475 insertions(+) create mode 100644 notebooks/Imputation_best_practices/Imputation_best_practices.ipynb diff --git a/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb b/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb new file mode 100644 index 0000000..0ecc6f1 --- /dev/null +++ b/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb @@ -0,0 +1,4475 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "fce74c70-b998-437d-bd77-43d723d57f13", + "metadata": {}, + "source": [ + "# Handling Missing Data\n", + "One of the first steps in any data science workflow is to understand the dataset and to clean it. This is because real world datasets are often very messy and require significant preprocessing before they can be used for subsequent data science tasks such as feature engineering, model training, etc. One of the tasks within data cleaning is to handle with missing data. There are several approaches that can be taken for missing data, such as dropping it, filling with 0's, filling with mean, KNN imputation, etc. In this notebook, we will explore 2 of these imputation techniques, and compare their effectiveness on two sample datasets.\n", + "\n", + "a. The first sample dataset we will use is random numbers, we will generate ~1000 random numbers and perform basic KNN and mean imputation.\n", + "\n", + "b. The second sample dataset we will use is UCI housing dataset, we will use both scaled and non-scaled imputation technique for mean and KNN imputation" + ] + }, + { + "cell_type": "markdown", + "id": "e2ceaeb0-e282-4c63-97e2-f1dd03810aa2", + "metadata": {}, + "source": [ + "# What to try in this notebook?\n", + "\n", + "#### 1. Get a random number generated dataset from kaggle, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", + "\n", + "\n", + "#### 2. Use a housing dataset from UCI, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", + "\n", + "Dataset - https://raw.githubusercontent.com/SheshNGupta/datasets/main/train.csv" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "d8fe4103-6e71-4b97-810c-b599a0482944", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "from sklearn.impute import KNNImputer\n", + "from sklearn.preprocessing import MinMaxScaler" + ] + }, + { + "cell_type": "markdown", + "id": "f95427ef-d6bc-47b8-a516-45a05b238180", + "metadata": {}, + "source": [ + "# 1.1 Random Numbers dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "ae373dd4-26c0-46e8-bdba-dd1d31c77e4e", + "metadata": {}, + "outputs": [], + "source": [ + "random_dataset = pd.DataFrame({'number': np.random.rand(1000)})" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "5ea97930-03cd-48ff-97b9-97e9cd9dde55", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
number
8230.925249
2660.077479
9590.897447
4930.259423
7680.193178
1050.174632
6100.456349
8240.688290
9680.493667
8490.368834
\n", + "
" + ], + "text/plain": [ + " number\n", + "823 0.925249\n", + "266 0.077479\n", + "959 0.897447\n", + "493 0.259423\n", + "768 0.193178\n", + "105 0.174632\n", + "610 0.456349\n", + "824 0.688290\n", + "968 0.493667\n", + "849 0.368834" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "random_dataset.sample(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "f19e199b-91aa-4e03-9e07-37f5a574d481", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 1000 entries, 0 to 999\n", + "Data columns (total 1 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 number 1000 non-null float64\n", + "dtypes: float64(1)\n", + "memory usage: 7.9 KB\n" + ] + } + ], + "source": [ + "random_dataset.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "382f0f03-b3f4-4244-a95c-e78476fae2ca", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "count 1000.000000\n", + "mean 0.494461\n", + "std 0.286876\n", + "min 0.001560\n", + "25% 0.252068\n", + "50% 0.489302\n", + "75% 0.733584\n", + "max 0.999815\n", + "Name: number, dtype: float64" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "random_dataset['number'].describe()" + ] + }, + { + "cell_type": "markdown", + "id": "348a0b85-c450-4d5d-a9d2-c57c95964b42", + "metadata": {}, + "source": [ + "#### Create 3 col. for numbers for 1%, 5% and 10% missing data" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "f5de26b3-17b7-463b-98e4-147a457ca37e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
00.4385640.4385640.4385640.438564
10.8368010.8368010.8368010.836801
20.7980770.7980770.7980770.798077
30.2691610.2691610.2691610.269161
40.8309480.8309480.8309480.830948
...............
9950.9201300.9201300.9201300.920130
9960.0073970.0073970.0073970.007397
9970.1633600.1633600.1633600.163360
9980.5537000.5537000.5537000.553700
9990.7714420.7714420.7714420.771442
\n", + "

1000 rows × 4 columns

\n", + "
" + ], + "text/plain": [ + " number number_copy_1_percent number_copy_5_percent \\\n", + "0 0.438564 0.438564 0.438564 \n", + "1 0.836801 0.836801 0.836801 \n", + "2 0.798077 0.798077 0.798077 \n", + "3 0.269161 0.269161 0.269161 \n", + "4 0.830948 0.830948 0.830948 \n", + ".. ... ... ... \n", + "995 0.920130 0.920130 0.920130 \n", + "996 0.007397 0.007397 0.007397 \n", + "997 0.163360 0.163360 0.163360 \n", + "998 0.553700 0.553700 0.553700 \n", + "999 0.771442 0.771442 0.771442 \n", + "\n", + " number_copy_10_percent \n", + "0 0.438564 \n", + "1 0.836801 \n", + "2 0.798077 \n", + "3 0.269161 \n", + "4 0.830948 \n", + ".. ... \n", + "995 0.920130 \n", + "996 0.007397 \n", + "997 0.163360 \n", + "998 0.553700 \n", + "999 0.771442 \n", + "\n", + "[1000 rows x 4 columns]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_number = random_dataset[['number']]\n", + "df_number['number_copy_1_percent'] = df_number[['number']]\n", + "df_number['number_copy_5_percent'] = df_number[['number']]\n", + "df_number['number_copy_10_percent'] = df_number[['number']]\n", + "df_number" + ] + }, + { + "cell_type": "markdown", + "id": "1ff95002-46a0-454b-97c1-6c189153d459", + "metadata": {}, + "source": [ + "#### Check % missing values in this dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "35c38775-26d9-4b1e-97a9-4c46c0d5d92b", + "metadata": {}, + "outputs": [], + "source": [ + "def get_percent_missing(dataframe):\n", + " \n", + " percent_missing = dataframe.isnull().sum() * 100 / len(dataframe)\n", + " missing_value_df = pd.DataFrame({'column_name': dataframe.columns,\n", + " 'percent_missing': percent_missing})\n", + " return missing_value_df" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "6837b7e5-4444-4914-9c0e-a9cefd2c7b6f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 0.0\n", + "number_copy_5_percent number_copy_5_percent 0.0\n", + "number_copy_10_percent number_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_number))" + ] + }, + { + "cell_type": "markdown", + "id": "25318ebf-b1bf-4f4b-ba1d-011b27a27f39", + "metadata": {}, + "source": [ + "#### Create missing helper fn" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "76da9076-d9c8-417e-bcfc-8ce7066d1a53", + "metadata": {}, + "outputs": [], + "source": [ + "def create_missing(dataframe, percent, col):\n", + " dataframe.loc[dataframe.sample(frac = percent).index, col] = np.nan" + ] + }, + { + "cell_type": "markdown", + "id": "9dc43e57-be39-4efe-8131-d6a3423b8d77", + "metadata": {}, + "source": [ + "#### Create missing data in each col" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "6e8ab693-6043-4ade-b62a-9b3fc9ebf735", + "metadata": {}, + "outputs": [], + "source": [ + "create_missing(df_number, 0.01, 'number_copy_1_percent')\n", + "create_missing(df_number, 0.05, 'number_copy_5_percent')\n", + "create_missing(df_number, 0.1, 'number_copy_10_percent')" + ] + }, + { + "cell_type": "markdown", + "id": "655cb92a-6b63-4498-9c31-d63f11145569", + "metadata": {}, + "source": [ + "#### Check % missing after removing data" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "412518b5-67ec-4a5a-9720-4a0ce7657d44", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 1.0\n", + "number_copy_5_percent number_copy_5_percent 5.0\n", + "number_copy_10_percent number_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_number))" + ] + }, + { + "cell_type": "markdown", + "id": "6876e3fc-b878-4560-a3a4-72c36f2a422e", + "metadata": {}, + "source": [ + "#### Store the indices of missing rows" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "c1860270-add6-4963-9aef-27ef1e171fca", + "metadata": {}, + "outputs": [], + "source": [ + "# Store Index of NaN values in each coloumns\n", + "number_1_idx = list(np.where(df_number['number_copy_1_percent'].isna())[0])\n", + "number_5_idx = list(np.where(df_number['number_copy_5_percent'].isna())[0])\n", + "number_10_idx = list(np.where(df_number['number_copy_10_percent'].isna())[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "57841da6-b453-40cc-8ecc-702fe4613a74", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Length of number_1_idx is 10 and it contains 1.0% of total data in column | Total rows: 1000\n", + "Length of number_5_idx is 50 and it contains 5.0% of total data in column | Total rows: 1000\n", + "Length of number_10_idx is 100 and it contains 10.0% of total data in column | Total rows: 1000\n" + ] + } + ], + "source": [ + "print(f\"Length of number_1_idx is {len(number_1_idx)} and it contains {(len(number_1_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")\n", + "print(f\"Length of number_5_idx is {len(number_5_idx)} and it contains {(len(number_5_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")\n", + "print(f\"Length of number_10_idx is {len(number_10_idx)} and it contains {(len(number_10_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")" + ] + }, + { + "cell_type": "markdown", + "id": "93450753-9080-4b17-b785-76acd5f9e19f", + "metadata": {}, + "source": [ + "## What is KNN imputation?\n", + "Imputation methodology that works on data that identifies the neighboring points through a measure of distance and the missing values can be estimated using completed values of neighboring observations." + ] + }, + { + "cell_type": "markdown", + "id": "47469d0b-a8f3-4469-b18c-3a457f7dc373", + "metadata": {}, + "source": [ + "### Perform KNN impute to df_number dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "b09c6c85-4ce3-4aeb-bb81-6a698494a58e", + "metadata": {}, + "outputs": [], + "source": [ + "df_number1 = df_number.copy(deep=True)\n", + "imputer = KNNImputer(n_neighbors=5)\n", + "imputed_number_df = pd.DataFrame(imputer.fit_transform(df_number1), columns = df_number1.columns)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "2f051a7d-3ebd-4839-aae0-ef125944d613", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
7010.2446290.2446290.2446290.244629
390.5172020.5172020.5172020.517202
3350.1008130.1008130.1008130.100813
2040.2775340.2775340.2775340.277534
3910.8590320.8590320.8572310.859032
2030.2526220.2526220.2526220.252622
1440.8445870.8445870.8445870.844587
2010.4316030.4316030.4316030.431603
7490.8485370.8485370.8485370.848240
4970.4645310.4645310.4645310.464531
\n", + "
" + ], + "text/plain": [ + " number number_copy_1_percent number_copy_5_percent \\\n", + "701 0.244629 0.244629 0.244629 \n", + "39 0.517202 0.517202 0.517202 \n", + "335 0.100813 0.100813 0.100813 \n", + "204 0.277534 0.277534 0.277534 \n", + "391 0.859032 0.859032 0.857231 \n", + "203 0.252622 0.252622 0.252622 \n", + "144 0.844587 0.844587 0.844587 \n", + "201 0.431603 0.431603 0.431603 \n", + "749 0.848537 0.848537 0.848537 \n", + "497 0.464531 0.464531 0.464531 \n", + "\n", + " number_copy_10_percent \n", + "701 0.244629 \n", + "39 0.517202 \n", + "335 0.100813 \n", + "204 0.277534 \n", + "391 0.859032 \n", + "203 0.252622 \n", + "144 0.844587 \n", + "201 0.431603 \n", + "749 0.848240 \n", + "497 0.464531 " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imputed_number_df.sample(10)" + ] + }, + { + "cell_type": "markdown", + "id": "ddc79a45-bd2b-44f3-a3c4-aaefa73b43d9", + "metadata": {}, + "source": [ + "#### Check the % missing data in dataframe now" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "5c98d450-bf5a-46e5-9091-c6a1202a2611", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 0.0\n", + "number_copy_5_percent number_copy_5_percent 0.0\n", + "number_copy_10_percent number_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(imputed_number_df))" + ] + }, + { + "cell_type": "markdown", + "id": "f14476bf-29e6-4d9a-9cd4-9dd56a53b466", + "metadata": {}, + "source": [ + "#### Store the list of differences between org. and Imputed value" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "3f096800-dc6e-4455-a9e6-2db18884e5ee", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "number_diff_1 = []\n", + "number_diff_5 = []\n", + "number_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in number_1_idx:\n", + " count +=1\n", + " diff1 = abs(imputed_number_df['number_copy_1_percent'][i] - df_number1['number'][i])\n", + " number_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in number_5_idx:\n", + " diff5 = abs(imputed_number_df['number_copy_5_percent'][i] - df_number1['number'][i])\n", + " number_diff_5.append(diff5)\n", + "\n", + "for i in number_10_idx:\n", + " diff10 = abs(imputed_number_df['number_copy_10_percent'][i] - df_number1['number'][i])\n", + " number_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "4a2c29fc-99f3-4624-808e-437d3983cabb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(number_diff_1))\n", + "print(len(number_diff_5))\n", + "print(len(number_diff_10))" + ] + }, + { + "cell_type": "markdown", + "id": "4ec4adbe-5571-40e3-90ba-92cb431161ca", + "metadata": {}, + "source": [ + "### Calculate the mean and varience of list of differences KNN" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "1163cb62-9dc4-427e-b5cf-20bf3e16d79b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.0005846547543839273 and varience 1% is 2.970798404420463e-07\n", + "The mean of 5% is 0.000757031064033434 and varience 5% is 4.329913201182178e-07\n", + "The mean of 10% is 0.000757031064033434 and varience 10% is 4.0351965946805086e-07\n" + ] + } + ], + "source": [ + "m1 = sum(number_diff_1) / len(number_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in number_diff_1) / len(number_diff_1)\n", + "\n", + "m5 = sum(number_diff_5) / len(number_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in number_diff_5) / len(number_diff_5)\n", + "\n", + "\n", + "m10 = sum(number_diff_10) / len(number_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in number_diff_10) / len(number_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "6987d059-7449-44a0-a3c2-8605362a18a0", + "metadata": {}, + "outputs": [], + "source": [ + "df_knn_number = pd.DataFrame.from_dict({'1%_number': [m1, var_res1],\n", + " '5%_number': [m5, var_res5],\n", + " '10%_number': [m10, var_res10]}, orient='index')\n", + "df_knn_number.columns=['diff. list Mean(KNN)', 'diff. list Var.(KNN)']" + ] + }, + { + "cell_type": "markdown", + "id": "8d1efbf1-61d6-43e1-9a4a-4af137e081c9", + "metadata": {}, + "source": [ + "## What is Mean imputation?\n", + "Mean imputation (MI) is a method in which the mean of the observed values for each variable is computed and the missing values for that variable are imputed by this mean." + ] + }, + { + "cell_type": "markdown", + "id": "41740e20-5dae-403e-a83b-94c91469fcc3", + "metadata": {}, + "source": [ + "### Perform MEAN based imputation" + ] + }, + { + "cell_type": "markdown", + "id": "17b69478-e97c-41b9-828a-eefbb46eb161", + "metadata": {}, + "source": [ + "#### Before mean imputation % missing" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "5a828216-8f1a-4157-8141-77e6c929f57a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 1.0\n", + "number_copy_5_percent number_copy_5_percent 5.0\n", + "number_copy_10_percent number_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "df_number2 = df_number.copy(deep=True)\n", + "print(get_percent_missing(df_number2))" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "1e137676-9f01-44b9-8a84-50d03a89436b", + "metadata": {}, + "outputs": [], + "source": [ + "df_number2['number_copy_1_percent'] = df_number2['number_copy_1_percent'].fillna(df_number2['number_copy_1_percent'].mean())\n", + "df_number2['number_copy_5_percent'] = df_number2['number_copy_5_percent'].fillna(df_number2['number_copy_5_percent'].mean())\n", + "df_number2['number_copy_10_percent'] = df_number2['number_copy_10_percent'].fillna(df_number2['number_copy_10_percent'].mean())" + ] + }, + { + "cell_type": "markdown", + "id": "8da82021-d96a-46ac-81df-035977cb5497", + "metadata": {}, + "source": [ + "#### After mean impute % missing " + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "669c14bd-f920-47db-8476-1cd1b4f4f5bb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 0.0\n", + "number_copy_5_percent number_copy_5_percent 0.0\n", + "number_copy_10_percent number_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_number2))" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "ccb60d18-b24e-4211-9947-46ee0bcc06fe", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
2930.5832310.5832310.5832310.583231
4610.8670350.8670350.8670350.867035
8750.6762280.6762280.6762280.676228
9990.7714420.7714420.7714420.771442
750.9090500.9090500.9090500.909050
980.6295830.6295830.6295830.629583
3810.1816140.1816140.1816140.181614
5920.5231090.5231090.5231090.523109
1550.0380740.0380740.0380740.038074
6300.8692000.8692000.8692000.869200
\n", + "
" + ], + "text/plain": [ + " number number_copy_1_percent number_copy_5_percent \\\n", + "293 0.583231 0.583231 0.583231 \n", + "461 0.867035 0.867035 0.867035 \n", + "875 0.676228 0.676228 0.676228 \n", + "999 0.771442 0.771442 0.771442 \n", + "75 0.909050 0.909050 0.909050 \n", + "98 0.629583 0.629583 0.629583 \n", + "381 0.181614 0.181614 0.181614 \n", + "592 0.523109 0.523109 0.523109 \n", + "155 0.038074 0.038074 0.038074 \n", + "630 0.869200 0.869200 0.869200 \n", + "\n", + " number_copy_10_percent \n", + "293 0.583231 \n", + "461 0.867035 \n", + "875 0.676228 \n", + "999 0.771442 \n", + "75 0.909050 \n", + "98 0.629583 \n", + "381 0.181614 \n", + "592 0.523109 \n", + "155 0.038074 \n", + "630 0.869200 " + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_number2.sample(10)" + ] + }, + { + "cell_type": "markdown", + "id": "88d89795-0ae9-4f37-89cd-b24d36658588", + "metadata": {}, + "source": [ + "#### Create a list of difference - MEAN" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "530979d5-52c4-473d-95f3-754c460a7ab6", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "number_diff_1_mean = []\n", + "number_diff_5_mean = []\n", + "number_diff_10_mean = []\n", + "count = 0\n", + "\n", + "for i in number_1_idx:\n", + " count +=1\n", + " diff1 = abs(df_number2['number_copy_1_percent'][i] - df_number2['number'][i])\n", + " number_diff_1_mean.append(diff1)\n", + " \n", + "\n", + "for i in number_5_idx:\n", + " diff5 = abs(df_number2['number_copy_5_percent'][i] - df_number2['number'][i])\n", + " number_diff_5_mean.append(diff5)\n", + "\n", + "for i in number_10_idx:\n", + " diff10 = abs(df_number2['number_copy_10_percent'][i] - df_number2['number'][i])\n", + " number_diff_10_mean.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "28dd2494-0175-431e-b4b7-09ee4af1f6a0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(number_diff_1_mean))\n", + "print(len(number_diff_5_mean))\n", + "print(len(number_diff_10_mean))" + ] + }, + { + "cell_type": "markdown", + "id": "4e90251e-4c0a-4e2d-82b1-8764374aed1c", + "metadata": {}, + "source": [ + "### Calculate the mean and var of the list of differences - MEAN Impute" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "682bd76e-4875-4b4d-b90b-91d8a6e492ae", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.29595595666774266 and varience 1% is 0.02234691636534702\n", + "The mean of 5% is 0.2606794287327926 and varience 5% is 0.017948559982927326\n", + "The mean of 10% is 0.2606794287327926 and varience 10% is 0.019225304317791198\n" + ] + } + ], + "source": [ + "m1 = sum(number_diff_1_mean) / len(number_diff_1_mean)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in number_diff_1_mean) / len(number_diff_1_mean)\n", + "\n", + "m5 = sum(number_diff_5_mean) / len(number_diff_5_mean)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in number_diff_5_mean) / len(number_diff_5_mean)\n", + "\n", + "\n", + "m10 = sum(number_diff_10_mean) / len(number_diff_10_mean)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in number_diff_10_mean) / len(number_diff_10_mean)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "1f41880d-3e7d-48c9-8744-7e47ccae3c17", + "metadata": {}, + "outputs": [], + "source": [ + "df_MI_number = pd.DataFrame.from_dict({'1%_number': [m1, var_res1],\n", + " '5%_number': [m5, var_res5],\n", + " '10%_number': [m10, var_res10]}, orient='index')\n", + "df_MI_number.columns=['diff. list Mean(MI)', 'diff. list Var.(MI)']" + ] + }, + { + "cell_type": "markdown", + "id": "ec64b079-db97-429c-ae3a-519eec91db3f", + "metadata": {}, + "source": [ + "## KNN and MEAN columns side by side" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "d74b0e73-e3f0-4107-806d-c5d5a50aab9a", + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import display_html\n", + "from itertools import chain,cycle\n", + "def display_side_by_side(*args,titles=cycle([''])):\n", + " html_str=''\n", + " for df,title in zip(args, chain(titles,cycle(['
'])) ):\n", + " html_str+=''\n", + " html_str+=f'

{title}

'\n", + " html_str+=df.to_html().replace('table','table style=\"display:inline\"')\n", + " html_str+=''\n", + " display_html(html_str,raw=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "747a487f-cbc4-467a-9bc7-b0856dbb6576", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from IPython.display import display, HTML\n", + "\n", + "CSS = \"\"\"\n", + ".output {\n", + " flex-direction: row;\n", + "}\n", + "\"\"\"\n", + "\n", + "HTML(''.format(CSS))" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "d24551d1-cd58-4a41-8262-873fe5034272", + "metadata": {}, + "outputs": [], + "source": [ + "# https://github.com/epmoyer/ipy_table/issues/24\n", + "\n", + "from IPython.core.display import HTML\n", + "\n", + "def multi_table(table_list):\n", + " ''' Acceps a list of IpyTable objects and returns a table which contains each IpyTable in a cell\n", + " '''\n", + " return HTML(\n", + " '' + \n", + " ''.join(['' for table in table_list]) +\n", + " '
' + table._repr_html_() + '
'\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "8a8daa30-3abf-4315-ae58-f9171ff000d5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[103, 272, 302, 441, 542]\n" + ] + } + ], + "source": [ + "print(number_1_idx[:5])" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "da6b1646-2417-42b7-bc8f-d3b0be85c61b", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1 = imputed_number_df.loc[:, [\"number\", \"number_copy_1_percent\"]]\n", + "compare_5 = imputed_number_df.loc[:, [\"number\", \"number_copy_5_percent\"]]\n", + "compare_10 = imputed_number_df.loc[:, [\"number\", \"number_copy_10_percent\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "380b94cf-264f-4a41-bb1d-ac272354073f", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1_df = compare_1.iloc[number_1_idx]\n", + "compare_5_df = compare_5.iloc[number_5_idx]\n", + "compare_10_df = compare_10.iloc[number_10_idx]" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "e5b21e71-0ddd-4c60-b931-b384d65230dd", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1_mean = df_number2.loc[:, [\"number\", \"number_copy_1_percent\"]]\n", + "compare_5_mean = df_number2.loc[:, [\"number\", \"number_copy_5_percent\"]]\n", + "compare_10_mean = df_number2.loc[:, [\"number\", \"number_copy_10_percent\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "29be3554-8129-4f0c-bad6-1270b7c6c05b", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1_mean_df = compare_1_mean.iloc[number_1_idx]\n", + "compare_5_mean_df = compare_5_mean.iloc[number_5_idx]\n", + "compare_10_mean_df = compare_10_mean.iloc[number_10_idx]" + ] + }, + { + "cell_type": "markdown", + "id": "72a3bc3c-0f91-49ad-bf03-dc4b7ace265d", + "metadata": {}, + "source": [ + "#### **number 1% KNN Impute VS number 1% Mean Impute**" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "6fd11f89-9f4b-49b3-b114-1ab3b461f180", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percent
1030.9155540.915539
2720.8994970.899795
3020.0912760.090500
4410.0508740.050914
5420.7447440.744208
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percent
1030.9155540.493992
2720.8994970.493992
3020.0912760.493992
4410.0508740.493992
5420.7447440.493992
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([compare_1_df.head(), compare_1_mean_df.head()])" + ] + }, + { + "cell_type": "markdown", + "id": "e1fc9d1c-53ef-42d3-809b-d68051057e48", + "metadata": {}, + "source": [ + "#### **number 5% KNN Impute VS number 5% Mean Impute**" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "a97c1530-2e50-48d2-a7e0-89fc70f648e5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_5_percent
60.0404510.039472
140.8520260.849692
160.2133430.212438
490.6082030.609078
640.9735740.972234
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_5_percent
60.0404510.49266
140.8520260.49266
160.2133430.49266
490.6082030.49266
640.9735740.49266
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([compare_5_df.head(), compare_5_mean_df.head()])" + ] + }, + { + "cell_type": "markdown", + "id": "1e732ac9-faf7-4457-baef-ac9c4976598c", + "metadata": {}, + "source": [ + "#### **number 10% KNN Impute VS number 10% Mean Impute**" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "f2d22e8f-5a0b-48c0-9150-a391d48e93b2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_10_percent
100.2057550.206019
160.2133430.212724
270.7387040.737446
290.3225770.322495
430.4038660.404988
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_10_percent
100.2057550.50025
160.2133430.50025
270.7387040.50025
290.3225770.50025
430.4038660.50025
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([compare_10_df.head(), compare_10_mean_df.head()])" + ] + }, + { + "cell_type": "markdown", + "id": "cc817314-971f-4abf-a56e-9830a5cf0329", + "metadata": {}, + "source": [ + "# 1.2 Random Numbers dataset Results - KNN and MEAN" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "c4ebb2fe-34e9-4bd2-bf53-9392e5d05e52", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN)diff. list Var.(KNN)
1%_number0.0005852.970798e-07
5%_number0.0007574.329913e-07
10%_number0.0006614.035197e-07
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI)diff. list Var.(MI)
1%_number0.2959560.022347
5%_number0.2606790.017949
10%_number0.2424770.019225
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([df_knn_number, df_MI_number])" + ] + }, + { + "cell_type": "markdown", + "id": "177bab9a-d501-479d-bbe8-d0c93926a24d", + "metadata": {}, + "source": [ + "Results : We can see here that KNN performed much better than the mean imputation since KNN will use the method of finding the nearest neighbour. The error in the actual and the imputed value is almost close to zero which signifies that this method is actually predicting and imputing correct values." + ] + }, + { + "cell_type": "markdown", + "id": "08586561-e3a5-4d15-a1c0-b8d71731a84a", + "metadata": {}, + "source": [ + "# 2.1 Housing Dataset " + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "c05f4dd5-4cdc-4617-939a-2e22ec859af1", + "metadata": {}, + "outputs": [], + "source": [ + "housing_data = pd.read_csv('https://raw.githubusercontent.com/SheshNGupta/datasets/main/train.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "8564d163-97ce-44da-8d3c-6f8cd9c1d0a1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
74074170RM60.09600PaveGrvlRegLvlAllPub...0NaNGdPrvNaN052007WDAbnorml132000
1209121020RL85.010182PaveNaNIR1LvlAllPub...0NaNNaNNaN052006NewPartial290000
646560RLNaN9375PaveNaNRegLvlAllPub...0NaNGdPrvNaN022009WDNormal219500
20820960RLNaN14364PaveNaNIR1LowAllPub...0NaNNaNNaN042007WDNormal277000
43643750RM40.04400PaveNaNRegLvlAllPub...0NaNNaNNaN0102006WDNormal116000
192020RL70.07560PaveNaNRegLvlAllPub...0NaNMnPrvNaN052009CODAbnorml139000
14491450180RM21.01533PaveNaNRegLvlAllPub...0NaNNaNNaN082006WDAbnorml92000
44945050RM50.06000PaveNaNRegLvlAllPub...0NaNNaNNaN062007WDNormal120000
1185118650RL60.09738PaveNaNRegLvlAllPub...0NaNNaNNaN032006WDNormal104900
10231024120RL43.03182PaveNaNRegLvlAllPub...0NaNNaNNaN052008WDNormal191000
\n", + "

10 rows × 81 columns

\n", + "
" + ], + "text/plain": [ + " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", + "740 741 70 RM 60.0 9600 Pave Grvl Reg \n", + "1209 1210 20 RL 85.0 10182 Pave NaN IR1 \n", + "64 65 60 RL NaN 9375 Pave NaN Reg \n", + "208 209 60 RL NaN 14364 Pave NaN IR1 \n", + "436 437 50 RM 40.0 4400 Pave NaN Reg \n", + "19 20 20 RL 70.0 7560 Pave NaN Reg \n", + "1449 1450 180 RM 21.0 1533 Pave NaN Reg \n", + "449 450 50 RM 50.0 6000 Pave NaN Reg \n", + "1185 1186 50 RL 60.0 9738 Pave NaN Reg \n", + "1023 1024 120 RL 43.0 3182 Pave NaN Reg \n", + "\n", + " LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal \\\n", + "740 Lvl AllPub ... 0 NaN GdPrv NaN 0 \n", + "1209 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "64 Lvl AllPub ... 0 NaN GdPrv NaN 0 \n", + "208 Low AllPub ... 0 NaN NaN NaN 0 \n", + "436 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "19 Lvl AllPub ... 0 NaN MnPrv NaN 0 \n", + "1449 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "449 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "1185 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "1023 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "\n", + " MoSold YrSold SaleType SaleCondition SalePrice \n", + "740 5 2007 WD Abnorml 132000 \n", + "1209 5 2006 New Partial 290000 \n", + "64 2 2009 WD Normal 219500 \n", + "208 4 2007 WD Normal 277000 \n", + "436 10 2006 WD Normal 116000 \n", + "19 5 2009 COD Abnorml 139000 \n", + "1449 8 2006 WD Abnorml 92000 \n", + "449 6 2007 WD Normal 120000 \n", + "1185 3 2006 WD Normal 104900 \n", + "1023 5 2008 WD Normal 191000 \n", + "\n", + "[10 rows x 81 columns]" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data.sample(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "bd81975c-0a21-414b-8e20-3564d35b9f9b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "663" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['SalePrice'].nunique()" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "67d1046e-a1ad-412e-a7e8-a0d51729cec7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1073" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['LotArea'].nunique()" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "64b05e52-72dc-4f7d-aca3-d043036b4d2f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "count 1460.000000\n", + "mean 180921.195890\n", + "std 79442.502883\n", + "min 34900.000000\n", + "25% 129975.000000\n", + "50% 163000.000000\n", + "75% 214000.000000\n", + "max 755000.000000\n", + "Name: SalePrice, dtype: float64" + ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['SalePrice'].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "b7e9928c-4785-4ee1-8150-cd0fa1ef3325", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "count 1460.000000\n", + "mean 10516.828082\n", + "std 9981.264932\n", + "min 1300.000000\n", + "25% 7553.500000\n", + "50% 9478.500000\n", + "75% 11601.500000\n", + "max 215245.000000\n", + "Name: LotArea, dtype: float64" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['LotArea'].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "20149f80-07dc-4eaa-8d0e-7de6612a7dce", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "Id Id 0.000000\n", + "MSSubClass MSSubClass 0.000000\n", + "MSZoning MSZoning 0.000000\n", + "LotFrontage LotFrontage 17.739726\n", + "LotArea LotArea 0.000000\n", + "Street Street 0.000000\n", + "Alley Alley 93.767123\n", + "LotShape LotShape 0.000000\n", + "LandContour LandContour 0.000000\n", + "Utilities Utilities 0.000000\n", + "LotConfig LotConfig 0.000000\n", + "LandSlope LandSlope 0.000000\n", + "Neighborhood Neighborhood 0.000000\n", + "Condition1 Condition1 0.000000\n", + "Condition2 Condition2 0.000000\n", + "BldgType BldgType 0.000000\n", + "HouseStyle HouseStyle 0.000000\n", + "OverallQual OverallQual 0.000000\n", + "OverallCond OverallCond 0.000000\n", + "YearBuilt YearBuilt 0.000000\n", + "YearRemodAdd YearRemodAdd 0.000000\n", + "RoofStyle RoofStyle 0.000000\n", + "RoofMatl RoofMatl 0.000000\n", + "Exterior1st Exterior1st 0.000000\n", + "Exterior2nd Exterior2nd 0.000000\n", + "MasVnrType MasVnrType 0.547945\n", + "MasVnrArea MasVnrArea 0.547945\n", + "ExterQual ExterQual 0.000000\n", + "ExterCond ExterCond 0.000000\n", + "Foundation Foundation 0.000000\n", + "BsmtQual BsmtQual 2.534247\n", + "BsmtCond BsmtCond 2.534247\n", + "BsmtExposure BsmtExposure 2.602740\n", + "BsmtFinType1 BsmtFinType1 2.534247\n", + "BsmtFinSF1 BsmtFinSF1 0.000000\n", + "BsmtFinType2 BsmtFinType2 2.602740\n", + "BsmtFinSF2 BsmtFinSF2 0.000000\n", + "BsmtUnfSF BsmtUnfSF 0.000000\n", + "TotalBsmtSF TotalBsmtSF 0.000000\n", + "Heating Heating 0.000000\n", + "HeatingQC HeatingQC 0.000000\n", + "CentralAir CentralAir 0.000000\n", + "Electrical Electrical 0.068493\n", + "1stFlrSF 1stFlrSF 0.000000\n", + "2ndFlrSF 2ndFlrSF 0.000000\n", + "LowQualFinSF LowQualFinSF 0.000000\n", + "GrLivArea GrLivArea 0.000000\n", + "BsmtFullBath BsmtFullBath 0.000000\n", + "BsmtHalfBath BsmtHalfBath 0.000000\n", + "FullBath FullBath 0.000000\n", + "HalfBath HalfBath 0.000000\n", + "BedroomAbvGr BedroomAbvGr 0.000000\n", + "KitchenAbvGr KitchenAbvGr 0.000000\n", + "KitchenQual KitchenQual 0.000000\n", + "TotRmsAbvGrd TotRmsAbvGrd 0.000000\n", + "Functional Functional 0.000000\n", + "Fireplaces Fireplaces 0.000000\n", + "FireplaceQu FireplaceQu 47.260274\n", + "GarageType GarageType 5.547945\n", + "GarageYrBlt GarageYrBlt 5.547945\n", + "GarageFinish GarageFinish 5.547945\n", + "GarageCars GarageCars 0.000000\n", + "GarageArea GarageArea 0.000000\n", + "GarageQual GarageQual 5.547945\n", + "GarageCond GarageCond 5.547945\n", + "PavedDrive PavedDrive 0.000000\n", + "WoodDeckSF WoodDeckSF 0.000000\n", + "OpenPorchSF OpenPorchSF 0.000000\n", + "EnclosedPorch EnclosedPorch 0.000000\n", + "3SsnPorch 3SsnPorch 0.000000\n", + "ScreenPorch ScreenPorch 0.000000\n", + "PoolArea PoolArea 0.000000\n", + "PoolQC PoolQC 99.520548\n", + "Fence Fence 80.753425\n", + "MiscFeature MiscFeature 96.301370\n", + "MiscVal MiscVal 0.000000\n", + "MoSold MoSold 0.000000\n", + "YrSold YrSold 0.000000\n", + "SaleType SaleType 0.000000\n", + "SaleCondition SaleCondition 0.000000\n", + "SalePrice SalePrice 0.000000\n" + ] + } + ], + "source": [ + "pd.set_option('display.max_rows', None)\n", + "print(get_percent_missing(housing_data))" + ] + }, + { + "cell_type": "markdown", + "id": "c8eb3ee3-085d-4b41-9a5f-c83a3805f870", + "metadata": {}, + "source": [ + "#### Using Sale price coloumn for KNN and MEAN imputation task" + ] + }, + { + "cell_type": "markdown", + "id": "451c79fb-17ba-40ac-8f0b-87a8b2ec4837", + "metadata": {}, + "source": [ + "#### Non Scaled dataframe Sale Price - take first 1000 rows" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "id": "9cc1f97f-1b24-4570-8f6a-30426bd79269", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
0208500208500208500208500
1181500181500181500181500
2223500223500223500223500
3140000140000140000140000
4250000250000250000250000
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 208500 208500 208500 208500\n", + "1 181500 181500 181500 181500\n", + "2 223500 223500 223500 223500\n", + "3 140000 140000 140000 140000\n", + "4 250000 250000 250000 250000" + ] + }, + "execution_count": 58, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice = housing_data[['SalePrice']][:1000]\n", + "df_saleprice['sp_copy_1_percent'] = df_saleprice[['SalePrice']]\n", + "df_saleprice['sp_copy_5_percent'] = df_saleprice[['SalePrice']]\n", + "df_saleprice['sp_copy_10_percent'] = df_saleprice[['SalePrice']]\n", + "df_saleprice.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "id": "f462f065-9f37-44f1-a22e-92e610dae2e9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1000" + ] + }, + "execution_count": 59, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(df_saleprice)" + ] + }, + { + "cell_type": "markdown", + "id": "03407bbd-f8a7-4f6c-a7c3-64a865ed3f7e", + "metadata": {}, + "source": [ + "#### Scaled Dataframe SalePrice - take first 1000 rows" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "e461b1ef-df2c-410f-aea8-abe954fa9afd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
00.2410780.2410780.2410780.241078
10.2035830.2035830.2035830.203583
20.2619080.2619080.2619080.261908
30.1459520.1459520.1459520.145952
40.2987090.2987090.2987090.298709
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 0.241078 0.241078 0.241078 0.241078\n", + "1 0.203583 0.203583 0.203583 0.203583\n", + "2 0.261908 0.261908 0.261908 0.261908\n", + "3 0.145952 0.145952 0.145952 0.145952\n", + "4 0.298709 0.298709 0.298709 0.298709" + ] + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scaler = MinMaxScaler()\n", + "df_saleprice_scaled = df_saleprice.copy(deep=True)\n", + "df_saleprice_scaled = pd.DataFrame(scaler.fit_transform(df_saleprice_scaled), columns = df_saleprice_scaled.columns)\n", + "df_saleprice_scaled.head()" + ] + }, + { + "cell_type": "markdown", + "id": "a66683c4-f66a-4aa1-ab8a-f28087b60b6c", + "metadata": {}, + "source": [ + "#### Check % missing values in this dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "0075fa0f-4b82-4089-ab81-e5282497c4a3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice))" + ] + }, + { + "cell_type": "markdown", + "id": "619ef99f-55c0-422c-aaa8-73cd71fcf2fb", + "metadata": {}, + "source": [ + "#### Create 1%, 5% and 10% missing data" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "82df5098-4176-4fba-922f-ca84c0466f2a", + "metadata": {}, + "outputs": [], + "source": [ + "create_missing(df_saleprice, 0.01, 'sp_copy_1_percent')\n", + "create_missing(df_saleprice, 0.05, 'sp_copy_5_percent')\n", + "create_missing(df_saleprice, 0.1, 'sp_copy_10_percent')" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "0e90ae04-cd10-4507-a851-c187010f0be0", + "metadata": {}, + "outputs": [], + "source": [ + "create_missing(df_saleprice_scaled, 0.01, 'sp_copy_1_percent')\n", + "create_missing(df_saleprice_scaled, 0.05, 'sp_copy_5_percent')\n", + "create_missing(df_saleprice_scaled, 0.1, 'sp_copy_10_percent')" + ] + }, + { + "cell_type": "markdown", + "id": "a8237a82-5a33-4ce9-b4c7-a48ede4f5fef", + "metadata": {}, + "source": [ + "#### With/Without scaling dataframe missing values check" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "2794306d-89c7-4518-8979-9edb3d9441b1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice))" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "8351dbe2-b388-451d-9238-52c4ccabd425", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice_scaled))" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "id": "b11b093f-110b-4ef3-9d00-ac4fed45a956", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10" + ] + }, + "execution_count": 66, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice['sp_copy_1_percent'].isna().sum()" + ] + }, + { + "cell_type": "markdown", + "id": "360e0010-e085-435c-8902-80c6a7ea78be", + "metadata": {}, + "source": [ + "#### Store indices of missing values" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "e546096c-ce35-448e-aa97-0943d3535a87", + "metadata": {}, + "outputs": [], + "source": [ + "# Store Index of NaN values in each coloumns\n", + "sp_1_idx = list(np.where(df_saleprice['sp_copy_1_percent'].isna())[0])\n", + "sp_5_idx = list(np.where(df_saleprice['sp_copy_5_percent'].isna())[0])\n", + "sp_10_idx = list(np.where(df_saleprice['sp_copy_10_percent'].isna())[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "d409e2a5-b3a9-4ae1-9b17-88b7c642692d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_1_idx))\n", + "print(len(sp_5_idx))\n", + "print(len(sp_10_idx))" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "5839460a-e736-42e9-9a13-d5bab5683115", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Length of sp_1_idx is 10 and it contains 1.0% of total data in column | Total rows: 1000\n", + "Length of sp_5_idx is 50 and it contains 5.0% of total data in column | Total rows: 1000\n", + "Length of sp_10_idx is 100 and it contains 10.0% of total data in column | Total rows: 1000\n" + ] + } + ], + "source": [ + "print(f\"Length of sp_1_idx is {len(sp_1_idx)} and it contains {(len(sp_1_idx)/len(df_saleprice['sp_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")\n", + "print(f\"Length of sp_5_idx is {len(sp_5_idx)} and it contains {(len(sp_5_idx)/len(df_saleprice['sp_copy_5_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")\n", + "print(f\"Length of sp_10_idx is {len(sp_10_idx)} and it contains {(len(sp_10_idx)/len(df_saleprice['sp_copy_10_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")" + ] + }, + { + "cell_type": "markdown", + "id": "c1464c79-c0a9-4640-92dd-f0d5131634ab", + "metadata": {}, + "source": [ + "### Perform KNN to df_saleprice and df_saleprice_scaled dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "08fa2436-ffb8-4b5d-a7a1-9e2d63b14562", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice1 = df_saleprice.copy(deep=True)\n", + "imputer = KNNImputer(n_neighbors=5)\n", + "imputed_saleprice_df = pd.DataFrame(imputer.fit_transform(df_saleprice1), columns = df_saleprice1.columns)" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "205c7a96-3f1c-42a4-91de-f22f15ce9cb2", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice_scaled1 = df_saleprice_scaled.copy(deep=True)\n", + "imputer = KNNImputer(n_neighbors=5)\n", + "imputed_saleprice_scaled_df = pd.DataFrame(imputer.fit_transform(df_saleprice_scaled1), columns = df_saleprice_scaled1.columns)" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "id": "a482f58d-73b6-423c-b97a-140884830a0f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
0208500.0208500.0208500.0208500.0
1181500.0181500.0181500.0181500.0
2223500.0223500.0223500.0223500.0
3140000.0140000.0140000.0140000.0
4250000.0250000.0250000.0250000.0
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 208500.0 208500.0 208500.0 208500.0\n", + "1 181500.0 181500.0 181500.0 181500.0\n", + "2 223500.0 223500.0 223500.0 223500.0\n", + "3 140000.0 140000.0 140000.0 140000.0\n", + "4 250000.0 250000.0 250000.0 250000.0" + ] + }, + "execution_count": 72, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imputed_saleprice_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "id": "11f8f5ff-f06d-4ec2-a4e3-1324e807a537", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
00.2410780.2410780.2410780.241078
10.2035830.2035830.2035830.203583
20.2619080.2619080.2619080.261908
30.1459520.1459520.1459520.145952
40.2987090.2987090.2987090.298709
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 0.241078 0.241078 0.241078 0.241078\n", + "1 0.203583 0.203583 0.203583 0.203583\n", + "2 0.261908 0.261908 0.261908 0.261908\n", + "3 0.145952 0.145952 0.145952 0.145952\n", + "4 0.298709 0.298709 0.298709 0.298709" + ] + }, + "execution_count": 73, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imputed_saleprice_scaled_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "d9fd7fa1-4ce0-43be-9955-55ef759d930b", + "metadata": {}, + "source": [ + "#### Check % missing in saleprice and saleprice_scaled DF" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "id": "9ed0d36a-9584-4e3b-9201-2ac36827bce9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(imputed_saleprice_df))" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "7c842fce-bbd5-4c2c-bb1a-db5df92f6315", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(imputed_saleprice_scaled_df))" + ] + }, + { + "cell_type": "markdown", + "id": "ac47abb1-df5f-4686-bc67-6617140c008c", + "metadata": {}, + "source": [ + "#### Store the list of disfferences between Org. and Imputed Value" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "99e04554-568d-4efa-a110-768b50dfaee6", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_diff_1 = []\n", + "sp_diff_5 = []\n", + "sp_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(imputed_saleprice_df['sp_copy_1_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", + " sp_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(imputed_saleprice_df['sp_copy_5_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", + " sp_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(imputed_saleprice_df['sp_copy_10_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", + " sp_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "92204f8a-497c-470d-a770-59165d226cc9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_diff_1))\n", + "print(len(sp_diff_5))\n", + "print(len(sp_diff_10))" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "b8875fff-0289-4dd9-92c1-78dc9b730d22", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_scaled_diff_1 = []\n", + "sp_scaled_diff_5 = []\n", + "sp_scaled_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(imputed_saleprice_scaled_df['sp_copy_1_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", + " sp_scaled_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(imputed_saleprice_scaled_df['sp_copy_5_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", + " sp_scaled_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(imputed_saleprice_scaled_df['sp_copy_10_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", + " sp_scaled_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "id": "40192344-79a4-444c-a12a-2201dc5aa0c1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_scaled_diff_1))\n", + "print(len(sp_scaled_diff_5))\n", + "print(len(sp_scaled_diff_10))" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "a95bd45c-8a2f-4159-8306-399ec18a4c0f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0.0, 0.0, 0.0, 0.0, 0.0]" + ] + }, + "execution_count": 80, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sp_scaled_diff_1[:5]" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "0f73d420-8842-4062-ae17-158a0a25e169", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0.0, 100.0, 20.0, 0.0, 780.0]" + ] + }, + "execution_count": 81, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sp_diff_1[:5]" + ] + }, + { + "cell_type": "markdown", + "id": "a40fd400-913b-4011-b0b9-dd3ca0d5827a", + "metadata": {}, + "source": [ + "#### Calculate the mean and var of list of diff. KNN - SalePrice" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "80267827-7f73-49ff-b200-27cdb2963756", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 105.0 and varience 1% is 52105.0\n", + "The mean of 5% is 163.0120000000001 and varience 5% is 46018.96385599976\n", + "The mean of 10% is 163.0120000000001 and varience 10% is 3667553.3671999993\n" + ] + } + ], + "source": [ + "m1 = sum(sp_diff_1) / len(sp_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_diff_1) / len(sp_diff_1)\n", + "\n", + "m5 = sum(sp_diff_5) / len(sp_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_diff_5) / len(sp_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_diff_10) / len(sp_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_diff_10) / len(sp_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "358545ff-2fcf-4c99-9049-4eaf6dd110bd", + "metadata": {}, + "outputs": [], + "source": [ + "df_knn_saleprice = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", + " '5%_saleprice': [m5, var_res5],\n", + " '10%_saleprice': [m10, var_res10]}, orient='index')\n", + "df_knn_saleprice.columns=['diff. list Mean(KNN)', 'diff. list Var.(KNN)']" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "3714c8f9-58db-40a7-b5a2-6bb7e788b734", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN)diff. list Var.(KNN)
1%_saleprice105.0005.210500e+04
5%_saleprice163.0124.601896e+04
10%_saleprice470.8003.667553e+06
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(KNN) diff. list Var.(KNN)\n", + "1%_saleprice 105.000 5.210500e+04\n", + "5%_saleprice 163.012 4.601896e+04\n", + "10%_saleprice 470.800 3.667553e+06" + ] + }, + "execution_count": 84, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_knn_saleprice" + ] + }, + { + "cell_type": "markdown", + "id": "fd7608a8-c5fb-425c-a340-af01801ee349", + "metadata": {}, + "source": [ + "#### Calculate the mean and var of list of diff. KNN - SalePrice scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "bb03017f-3d91-48d9-8ebf-7cb5c25fadc3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.0 and varience 1% is 0.0\n", + "The mean of 5% is 1.2498264129982007e-05 and varience 5% is 7.654123706876951e-09\n", + "The mean of 10% is 1.2498264129982007e-05 and varience 10% is 2.9738417673284677e-06\n" + ] + } + ], + "source": [ + "m1 = sum(sp_scaled_diff_1) / len(sp_scaled_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_scaled_diff_1) / len(sp_scaled_diff_1)\n", + "\n", + "m5 = sum(sp_scaled_diff_5) / len(sp_scaled_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_scaled_diff_5) / len(sp_scaled_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_scaled_diff_10) / len(sp_scaled_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_scaled_diff_10) / len(sp_scaled_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "290d8db2-c9f4-4028-ab44-ad68c9e7b3c5", + "metadata": {}, + "outputs": [], + "source": [ + "df_knn_saleprice_scaled = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", + " '5%_saleprice': [m5, var_res5],\n", + " '10%_saleprice': [m10, var_res10]}, orient='index')\n", + "df_knn_saleprice_scaled.columns=['diff. list Mean(KNN) scaled', 'diff. list Var.(KNN) scaled']" + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "id": "89347fd7-d87d-42bb-b375-a75417c395de", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN) scaleddiff. list Var.(KNN) scaled
1%_saleprice0.0000000.000000e+00
5%_saleprice0.0000127.654124e-09
10%_saleprice0.0002652.973842e-06
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(KNN) scaled diff. list Var.(KNN) scaled\n", + "1%_saleprice 0.000000 0.000000e+00\n", + "5%_saleprice 0.000012 7.654124e-09\n", + "10%_saleprice 0.000265 2.973842e-06" + ] + }, + "execution_count": 87, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_knn_saleprice_scaled" + ] + }, + { + "cell_type": "markdown", + "id": "c984dc69-f85f-4f1b-8c94-4afb48c1c8db", + "metadata": {}, + "source": [ + "### Perform MEAN imputation" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "008bc14f-45e7-42d8-b843-2fee7bcf26c2", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice2 = df_saleprice.copy(deep=True)\n", + "df_saleprice_scaled2 = df_saleprice_scaled.copy(deep=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "id": "bd71dc1a-f137-46ed-bf2b-f3d87fd4b6a0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice2))" + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "id": "46237cfd-6361-466f-b66f-32f5940149d6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice_scaled2))" + ] + }, + { + "cell_type": "markdown", + "id": "64465299-5620-47b9-a28d-afb5494f279e", + "metadata": {}, + "source": [ + "#### Impute Mean values in missing for saleprice and saleprice_scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "id": "28cf6b75-eebf-4758-94ec-4b3536f2c659", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice2['sp_copy_1_percent'] = df_saleprice2['sp_copy_1_percent'].fillna(df_saleprice2['sp_copy_1_percent'].mean())\n", + "df_saleprice2['sp_copy_5_percent'] = df_saleprice2['sp_copy_5_percent'].fillna(df_saleprice2['sp_copy_5_percent'].mean())\n", + "df_saleprice2['sp_copy_10_percent'] = df_saleprice2['sp_copy_10_percent'].fillna(df_saleprice2['sp_copy_10_percent'].mean())" + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "id": "2409dd8c-3cd0-4742-b0ac-14dea1fdb504", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice_scaled2['sp_copy_1_percent'] = df_saleprice_scaled2['sp_copy_1_percent'].fillna(df_saleprice_scaled2['sp_copy_1_percent'].mean())\n", + "df_saleprice_scaled2['sp_copy_5_percent'] = df_saleprice_scaled2['sp_copy_5_percent'].fillna(df_saleprice_scaled2['sp_copy_5_percent'].mean())\n", + "df_saleprice_scaled2['sp_copy_10_percent'] = df_saleprice_scaled2['sp_copy_10_percent'].fillna(df_saleprice_scaled2['sp_copy_10_percent'].mean())" + ] + }, + { + "cell_type": "markdown", + "id": "62377754-b682-45e5-8faa-1a4a186bd3c7", + "metadata": {}, + "source": [ + "#### After MEAN imputation - Saleprice and saleprice scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "id": "6c448556-55f4-4685-aed2-6b67d5ad8a2a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice2))" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "id": "d9775fbf-7a72-4352-b446-488e9d25b6a2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice_scaled2))" + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "id": "136f87e6-a4af-4229-b36a-695f712deee5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
436116000116000.0116000.0116000.000000
21139400139400.0139400.0139400.000000
618314813314813.0314813.0314813.000000
207141000141000.0141000.0182369.783333
366159000159000.0159000.0159000.000000
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "436 116000 116000.0 116000.0 116000.000000\n", + "21 139400 139400.0 139400.0 139400.000000\n", + "618 314813 314813.0 314813.0 314813.000000\n", + "207 141000 141000.0 141000.0 182369.783333\n", + "366 159000 159000.0 159000.0 159000.000000" + ] + }, + "execution_count": 95, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice2.sample(5)" + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "id": "784cb61c-78f8-4b31-b709-379c50024dca", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
4570.3070410.3070410.3070410.201890
8760.1351900.1351900.1351900.135190
3610.1528950.1528950.1528950.152895
6820.1917790.1917790.1917790.201890
5230.2080960.2080960.2080960.208096
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "457 0.307041 0.307041 0.307041 0.201890\n", + "876 0.135190 0.135190 0.135190 0.135190\n", + "361 0.152895 0.152895 0.152895 0.152895\n", + "682 0.191779 0.191779 0.191779 0.201890\n", + "523 0.208096 0.208096 0.208096 0.208096" + ] + }, + "execution_count": 96, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice_scaled2.sample(5)" + ] + }, + { + "cell_type": "markdown", + "id": "33c1f3b7-5afc-45cb-8b43-9682ec87156d", + "metadata": {}, + "source": [ + "#### Create List of differences for saleprice and saleprice_scaled Dataframes" + ] + }, + { + "cell_type": "code", + "execution_count": 97, + "id": "d2faf410-f83e-4ccb-89d4-e6f8c7adffbb", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_mean_diff_1 = []\n", + "sp_mean_diff_5 = []\n", + "sp_mean_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(df_saleprice2['sp_copy_1_percent'][i] - df_saleprice2['SalePrice'][i])\n", + " sp_mean_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(df_saleprice2['sp_copy_5_percent'][i] - df_saleprice2['SalePrice'][i])\n", + " sp_mean_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(df_saleprice2['sp_copy_10_percent'][i] - df_saleprice2['SalePrice'][i])\n", + " sp_mean_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 98, + "id": "789b07c5-530a-4111-8c97-f5297f7da5e4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_mean_diff_1))\n", + "print(len(sp_mean_diff_5))\n", + "print(len(sp_mean_diff_10))" + ] + }, + { + "cell_type": "code", + "execution_count": 99, + "id": "4fec222c-2420-41af-9e2a-d9773e1d6259", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_scaled_mean_diff_1 = []\n", + "sp_scaled_mean_diff_5 = []\n", + "sp_scaled_mean_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(df_saleprice_scaled2['sp_copy_1_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", + " sp_scaled_mean_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(df_saleprice_scaled2['sp_copy_5_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", + " sp_scaled_mean_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(df_saleprice_scaled2['sp_copy_10_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", + " sp_scaled_mean_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 100, + "id": "de9bf1de-68fe-4894-915a-7069b386123f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_scaled_mean_diff_1))\n", + "print(len(sp_scaled_mean_diff_5))\n", + "print(len(sp_scaled_mean_diff_10))" + ] + }, + { + "cell_type": "markdown", + "id": "f7b93757-d1a7-41a1-85fa-3ee77734be5b", + "metadata": {}, + "source": [ + "#### Calculate mean and var of list of diff. - MEAN impute SalePrice" + ] + }, + { + "cell_type": "code", + "execution_count": 101, + "id": "c60d3aad-33f0-48f4-8bb0-f8af45e33e1e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 47198.61696969698 and varience 1% is 634546571.3543438\n", + "The mean of 5% is 54438.20686315788 and varience 5% is 1768876209.3358026\n", + "The mean of 10% is 54438.20686315788 and varience 10% is 2875290913.3009353\n" + ] + } + ], + "source": [ + "m1 = sum(sp_mean_diff_1) / len(sp_mean_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_mean_diff_1) / len(sp_mean_diff_1)\n", + "\n", + "m5 = sum(sp_mean_diff_5) / len(sp_mean_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_mean_diff_5) / len(sp_mean_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_mean_diff_10) / len(sp_mean_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_mean_diff_10) / len(sp_mean_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "id": "e7f6e5cf-4eaa-4bfe-add2-fc7f600941b7", + "metadata": {}, + "outputs": [], + "source": [ + "df_mean_saleprice = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", + " '5%_saleprice': [m5, var_res5],\n", + " '10%_saleprice': [m10, var_res10]}, orient='index')\n", + "df_mean_saleprice.columns=['diff. list Mean(MI)', 'diff. list Var.(MI)']" + ] + }, + { + "cell_type": "code", + "execution_count": 103, + "id": "cc37eeaf-e3cd-4a83-870d-fab7037eeffe", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI)diff. list Var.(MI)
1%_saleprice47198.6169706.345466e+08
5%_saleprice54438.2068631.768876e+09
10%_saleprice58045.6366672.875291e+09
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(MI) diff. list Var.(MI)\n", + "1%_saleprice 47198.616970 6.345466e+08\n", + "5%_saleprice 54438.206863 1.768876e+09\n", + "10%_saleprice 58045.636667 2.875291e+09" + ] + }, + "execution_count": 103, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_mean_saleprice" + ] + }, + { + "cell_type": "markdown", + "id": "f405f073-1b45-47e8-873b-7a9d34ad0e5c", + "metadata": {}, + "source": [ + "#### Calculate mean and var of list of diff. - MEAN impute SalePrice scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 104, + "id": "2516b4f7-6b79-4636-9bd5-0738343ea355", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.0 and varience 1% is 0.0\n", + "The mean of 5% is 0.0016175777048509216 and varience 5% is 5.557201947380946e-05\n", + "The mean of 10% is 0.0016175777048509216 and varience 10% is 0.004250732648521598\n" + ] + } + ], + "source": [ + "m1 = sum(sp_scaled_mean_diff_1) / len(sp_scaled_mean_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_scaled_mean_diff_1) / len(sp_scaled_mean_diff_1)\n", + "\n", + "m5 = sum(sp_scaled_mean_diff_5) / len(sp_scaled_mean_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_scaled_mean_diff_5) / len(sp_scaled_mean_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_scaled_mean_diff_10) / len(sp_scaled_mean_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_scaled_mean_diff_10) / len(sp_scaled_mean_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 105, + "id": "fe6a93b8-d6cb-4d7d-856b-ab4ee8fe78fc", + "metadata": {}, + "outputs": [], + "source": [ + "df_mean_saleprice_scaled = pd.DataFrame.from_dict({'1%_saleprice_scaled': [m1, var_res1],\n", + " '5%_saleprice_scaled': [m5, var_res5],\n", + " '10%_saleprice_scaled': [m10, var_res10]}, orient='index')\n", + "df_mean_saleprice_scaled.columns=['diff. list Mean(MI) scaled', 'diff. list Var.(MI) scaled']" + ] + }, + { + "cell_type": "code", + "execution_count": 106, + "id": "e74c35ed-7c2d-44ab-b6c2-4d81c2c6b6bb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI) scaleddiff. list Var.(MI) scaled
1%_saleprice_scaled0.0000000.000000
5%_saleprice_scaled0.0016180.000056
10%_saleprice_scaled0.0189220.004251
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(MI) scaled diff. list Var.(MI) scaled\n", + "1%_saleprice_scaled 0.000000 0.000000\n", + "5%_saleprice_scaled 0.001618 0.000056\n", + "10%_saleprice_scaled 0.018922 0.004251" + ] + }, + "execution_count": 106, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_mean_saleprice_scaled" + ] + }, + { + "cell_type": "markdown", + "id": "876b979a-f5c4-43a7-9ead-d5d866bef078", + "metadata": {}, + "source": [ + "# 2.2 Housing Data Results - KNN and MEAN" + ] + }, + { + "cell_type": "code", + "execution_count": 107, + "id": "e90e9486-280d-4e96-b16a-0c3314eaedc9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN)diff. list Var.(KNN)
1%_saleprice105.0005.210500e+04
5%_saleprice163.0124.601896e+04
10%_saleprice470.8003.667553e+06
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN) scaleddiff. list Var.(KNN) scaled
1%_saleprice0.0000000.000000e+00
5%_saleprice0.0000127.654124e-09
10%_saleprice0.0002652.973842e-06
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI)diff. list Var.(MI)
1%_saleprice47198.6169706.345466e+08
5%_saleprice54438.2068631.768876e+09
10%_saleprice58045.6366672.875291e+09
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI) scaleddiff. list Var.(MI) scaled
1%_saleprice_scaled0.0000000.000000
5%_saleprice_scaled0.0016180.000056
10%_saleprice_scaled0.0189220.004251
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 107, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([df_knn_saleprice, df_knn_saleprice_scaled, df_mean_saleprice, df_mean_saleprice_scaled])" + ] + }, + { + "cell_type": "markdown", + "id": "e07a4e01-7e4e-4bdb-b6c7-ef2424fc6a80", + "metadata": {}, + "source": [ + "Result: Another takeaway here is that if we use scaling before performing the imputation, the imputation works much better and accuratly. Although the mean imputation provided less accurate results as compared to the KNN imputation, but the accuracy of the imputed values are still better if we use scaling than not using it. KNN imputation on other hand did perform better than mean imputation, however the results are much better if we use scaled dataset." + ] + }, + { + "cell_type": "markdown", + "id": "977be574-18b2-4f80-a019-2a86227a14d6", + "metadata": {}, + "source": [ + "# Conclusion\n", + "1. KNN imputation is performing better than mean imputation\n", + "2. If we use scaled dataset as compared to non scaled dataset, the results are even better (almost close to perfect!)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "764c9bdb-78dc-4287-a527-0e14ff58a5e9", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}