diff --git a/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb b/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb new file mode 100644 index 0000000..0ecc6f1 --- /dev/null +++ b/notebooks/Imputation_best_practices/Imputation_best_practices.ipynb @@ -0,0 +1,4475 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "fce74c70-b998-437d-bd77-43d723d57f13", + "metadata": {}, + "source": [ + "# Handling Missing Data\n", + "One of the first steps in any data science workflow is to understand the dataset and to clean it. This is because real world datasets are often very messy and require significant preprocessing before they can be used for subsequent data science tasks such as feature engineering, model training, etc. One of the tasks within data cleaning is to handle with missing data. There are several approaches that can be taken for missing data, such as dropping it, filling with 0's, filling with mean, KNN imputation, etc. In this notebook, we will explore 2 of these imputation techniques, and compare their effectiveness on two sample datasets.\n", + "\n", + "a. The first sample dataset we will use is random numbers, we will generate ~1000 random numbers and perform basic KNN and mean imputation.\n", + "\n", + "b. The second sample dataset we will use is UCI housing dataset, we will use both scaled and non-scaled imputation technique for mean and KNN imputation" + ] + }, + { + "cell_type": "markdown", + "id": "e2ceaeb0-e282-4c63-97e2-f1dd03810aa2", + "metadata": {}, + "source": [ + "# What to try in this notebook?\n", + "\n", + "#### 1. Get a random number generated dataset from kaggle, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", + "\n", + "\n", + "#### 2. Use a housing dataset from UCI, use one column and create missing (1%, 5%, 10%), scale values, apply KNN, MEAN imputation. Compare the results and compute mean() and var() for the list of differences between org. and Imputed value \n", + "\n", + "Dataset - https://raw.githubusercontent.com/SheshNGupta/datasets/main/train.csv" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "d8fe4103-6e71-4b97-810c-b599a0482944", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "from sklearn.impute import KNNImputer\n", + "from sklearn.preprocessing import MinMaxScaler" + ] + }, + { + "cell_type": "markdown", + "id": "f95427ef-d6bc-47b8-a516-45a05b238180", + "metadata": {}, + "source": [ + "# 1.1 Random Numbers dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "ae373dd4-26c0-46e8-bdba-dd1d31c77e4e", + "metadata": {}, + "outputs": [], + "source": [ + "random_dataset = pd.DataFrame({'number': np.random.rand(1000)})" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "5ea97930-03cd-48ff-97b9-97e9cd9dde55", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
number
8230.925249
2660.077479
9590.897447
4930.259423
7680.193178
1050.174632
6100.456349
8240.688290
9680.493667
8490.368834
\n", + "
" + ], + "text/plain": [ + " number\n", + "823 0.925249\n", + "266 0.077479\n", + "959 0.897447\n", + "493 0.259423\n", + "768 0.193178\n", + "105 0.174632\n", + "610 0.456349\n", + "824 0.688290\n", + "968 0.493667\n", + "849 0.368834" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "random_dataset.sample(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "f19e199b-91aa-4e03-9e07-37f5a574d481", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 1000 entries, 0 to 999\n", + "Data columns (total 1 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 number 1000 non-null float64\n", + "dtypes: float64(1)\n", + "memory usage: 7.9 KB\n" + ] + } + ], + "source": [ + "random_dataset.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "382f0f03-b3f4-4244-a95c-e78476fae2ca", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "count 1000.000000\n", + "mean 0.494461\n", + "std 0.286876\n", + "min 0.001560\n", + "25% 0.252068\n", + "50% 0.489302\n", + "75% 0.733584\n", + "max 0.999815\n", + "Name: number, dtype: float64" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "random_dataset['number'].describe()" + ] + }, + { + "cell_type": "markdown", + "id": "348a0b85-c450-4d5d-a9d2-c57c95964b42", + "metadata": {}, + "source": [ + "#### Create 3 col. for numbers for 1%, 5% and 10% missing data" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "f5de26b3-17b7-463b-98e4-147a457ca37e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
00.4385640.4385640.4385640.438564
10.8368010.8368010.8368010.836801
20.7980770.7980770.7980770.798077
30.2691610.2691610.2691610.269161
40.8309480.8309480.8309480.830948
...............
9950.9201300.9201300.9201300.920130
9960.0073970.0073970.0073970.007397
9970.1633600.1633600.1633600.163360
9980.5537000.5537000.5537000.553700
9990.7714420.7714420.7714420.771442
\n", + "

1000 rows × 4 columns

\n", + "
" + ], + "text/plain": [ + " number number_copy_1_percent number_copy_5_percent \\\n", + "0 0.438564 0.438564 0.438564 \n", + "1 0.836801 0.836801 0.836801 \n", + "2 0.798077 0.798077 0.798077 \n", + "3 0.269161 0.269161 0.269161 \n", + "4 0.830948 0.830948 0.830948 \n", + ".. ... ... ... \n", + "995 0.920130 0.920130 0.920130 \n", + "996 0.007397 0.007397 0.007397 \n", + "997 0.163360 0.163360 0.163360 \n", + "998 0.553700 0.553700 0.553700 \n", + "999 0.771442 0.771442 0.771442 \n", + "\n", + " number_copy_10_percent \n", + "0 0.438564 \n", + "1 0.836801 \n", + "2 0.798077 \n", + "3 0.269161 \n", + "4 0.830948 \n", + ".. ... \n", + "995 0.920130 \n", + "996 0.007397 \n", + "997 0.163360 \n", + "998 0.553700 \n", + "999 0.771442 \n", + "\n", + "[1000 rows x 4 columns]" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_number = random_dataset[['number']]\n", + "df_number['number_copy_1_percent'] = df_number[['number']]\n", + "df_number['number_copy_5_percent'] = df_number[['number']]\n", + "df_number['number_copy_10_percent'] = df_number[['number']]\n", + "df_number" + ] + }, + { + "cell_type": "markdown", + "id": "1ff95002-46a0-454b-97c1-6c189153d459", + "metadata": {}, + "source": [ + "#### Check % missing values in this dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "35c38775-26d9-4b1e-97a9-4c46c0d5d92b", + "metadata": {}, + "outputs": [], + "source": [ + "def get_percent_missing(dataframe):\n", + " \n", + " percent_missing = dataframe.isnull().sum() * 100 / len(dataframe)\n", + " missing_value_df = pd.DataFrame({'column_name': dataframe.columns,\n", + " 'percent_missing': percent_missing})\n", + " return missing_value_df" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "6837b7e5-4444-4914-9c0e-a9cefd2c7b6f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 0.0\n", + "number_copy_5_percent number_copy_5_percent 0.0\n", + "number_copy_10_percent number_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_number))" + ] + }, + { + "cell_type": "markdown", + "id": "25318ebf-b1bf-4f4b-ba1d-011b27a27f39", + "metadata": {}, + "source": [ + "#### Create missing helper fn" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "76da9076-d9c8-417e-bcfc-8ce7066d1a53", + "metadata": {}, + "outputs": [], + "source": [ + "def create_missing(dataframe, percent, col):\n", + " dataframe.loc[dataframe.sample(frac = percent).index, col] = np.nan" + ] + }, + { + "cell_type": "markdown", + "id": "9dc43e57-be39-4efe-8131-d6a3423b8d77", + "metadata": {}, + "source": [ + "#### Create missing data in each col" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "6e8ab693-6043-4ade-b62a-9b3fc9ebf735", + "metadata": {}, + "outputs": [], + "source": [ + "create_missing(df_number, 0.01, 'number_copy_1_percent')\n", + "create_missing(df_number, 0.05, 'number_copy_5_percent')\n", + "create_missing(df_number, 0.1, 'number_copy_10_percent')" + ] + }, + { + "cell_type": "markdown", + "id": "655cb92a-6b63-4498-9c31-d63f11145569", + "metadata": {}, + "source": [ + "#### Check % missing after removing data" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "412518b5-67ec-4a5a-9720-4a0ce7657d44", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 1.0\n", + "number_copy_5_percent number_copy_5_percent 5.0\n", + "number_copy_10_percent number_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_number))" + ] + }, + { + "cell_type": "markdown", + "id": "6876e3fc-b878-4560-a3a4-72c36f2a422e", + "metadata": {}, + "source": [ + "#### Store the indices of missing rows" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "c1860270-add6-4963-9aef-27ef1e171fca", + "metadata": {}, + "outputs": [], + "source": [ + "# Store Index of NaN values in each coloumns\n", + "number_1_idx = list(np.where(df_number['number_copy_1_percent'].isna())[0])\n", + "number_5_idx = list(np.where(df_number['number_copy_5_percent'].isna())[0])\n", + "number_10_idx = list(np.where(df_number['number_copy_10_percent'].isna())[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "57841da6-b453-40cc-8ecc-702fe4613a74", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Length of number_1_idx is 10 and it contains 1.0% of total data in column | Total rows: 1000\n", + "Length of number_5_idx is 50 and it contains 5.0% of total data in column | Total rows: 1000\n", + "Length of number_10_idx is 100 and it contains 10.0% of total data in column | Total rows: 1000\n" + ] + } + ], + "source": [ + "print(f\"Length of number_1_idx is {len(number_1_idx)} and it contains {(len(number_1_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")\n", + "print(f\"Length of number_5_idx is {len(number_5_idx)} and it contains {(len(number_5_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")\n", + "print(f\"Length of number_10_idx is {len(number_10_idx)} and it contains {(len(number_10_idx)/len(df_number['number_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_number['number_copy_1_percent'])}\")" + ] + }, + { + "cell_type": "markdown", + "id": "93450753-9080-4b17-b785-76acd5f9e19f", + "metadata": {}, + "source": [ + "## What is KNN imputation?\n", + "Imputation methodology that works on data that identifies the neighboring points through a measure of distance and the missing values can be estimated using completed values of neighboring observations." + ] + }, + { + "cell_type": "markdown", + "id": "47469d0b-a8f3-4469-b18c-3a457f7dc373", + "metadata": {}, + "source": [ + "### Perform KNN impute to df_number dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "b09c6c85-4ce3-4aeb-bb81-6a698494a58e", + "metadata": {}, + "outputs": [], + "source": [ + "df_number1 = df_number.copy(deep=True)\n", + "imputer = KNNImputer(n_neighbors=5)\n", + "imputed_number_df = pd.DataFrame(imputer.fit_transform(df_number1), columns = df_number1.columns)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "2f051a7d-3ebd-4839-aae0-ef125944d613", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
7010.2446290.2446290.2446290.244629
390.5172020.5172020.5172020.517202
3350.1008130.1008130.1008130.100813
2040.2775340.2775340.2775340.277534
3910.8590320.8590320.8572310.859032
2030.2526220.2526220.2526220.252622
1440.8445870.8445870.8445870.844587
2010.4316030.4316030.4316030.431603
7490.8485370.8485370.8485370.848240
4970.4645310.4645310.4645310.464531
\n", + "
" + ], + "text/plain": [ + " number number_copy_1_percent number_copy_5_percent \\\n", + "701 0.244629 0.244629 0.244629 \n", + "39 0.517202 0.517202 0.517202 \n", + "335 0.100813 0.100813 0.100813 \n", + "204 0.277534 0.277534 0.277534 \n", + "391 0.859032 0.859032 0.857231 \n", + "203 0.252622 0.252622 0.252622 \n", + "144 0.844587 0.844587 0.844587 \n", + "201 0.431603 0.431603 0.431603 \n", + "749 0.848537 0.848537 0.848537 \n", + "497 0.464531 0.464531 0.464531 \n", + "\n", + " number_copy_10_percent \n", + "701 0.244629 \n", + "39 0.517202 \n", + "335 0.100813 \n", + "204 0.277534 \n", + "391 0.859032 \n", + "203 0.252622 \n", + "144 0.844587 \n", + "201 0.431603 \n", + "749 0.848240 \n", + "497 0.464531 " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imputed_number_df.sample(10)" + ] + }, + { + "cell_type": "markdown", + "id": "ddc79a45-bd2b-44f3-a3c4-aaefa73b43d9", + "metadata": {}, + "source": [ + "#### Check the % missing data in dataframe now" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "5c98d450-bf5a-46e5-9091-c6a1202a2611", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 0.0\n", + "number_copy_5_percent number_copy_5_percent 0.0\n", + "number_copy_10_percent number_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(imputed_number_df))" + ] + }, + { + "cell_type": "markdown", + "id": "f14476bf-29e6-4d9a-9cd4-9dd56a53b466", + "metadata": {}, + "source": [ + "#### Store the list of differences between org. and Imputed value" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "3f096800-dc6e-4455-a9e6-2db18884e5ee", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "number_diff_1 = []\n", + "number_diff_5 = []\n", + "number_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in number_1_idx:\n", + " count +=1\n", + " diff1 = abs(imputed_number_df['number_copy_1_percent'][i] - df_number1['number'][i])\n", + " number_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in number_5_idx:\n", + " diff5 = abs(imputed_number_df['number_copy_5_percent'][i] - df_number1['number'][i])\n", + " number_diff_5.append(diff5)\n", + "\n", + "for i in number_10_idx:\n", + " diff10 = abs(imputed_number_df['number_copy_10_percent'][i] - df_number1['number'][i])\n", + " number_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "4a2c29fc-99f3-4624-808e-437d3983cabb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(number_diff_1))\n", + "print(len(number_diff_5))\n", + "print(len(number_diff_10))" + ] + }, + { + "cell_type": "markdown", + "id": "4ec4adbe-5571-40e3-90ba-92cb431161ca", + "metadata": {}, + "source": [ + "### Calculate the mean and varience of list of differences KNN" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "1163cb62-9dc4-427e-b5cf-20bf3e16d79b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.0005846547543839273 and varience 1% is 2.970798404420463e-07\n", + "The mean of 5% is 0.000757031064033434 and varience 5% is 4.329913201182178e-07\n", + "The mean of 10% is 0.000757031064033434 and varience 10% is 4.0351965946805086e-07\n" + ] + } + ], + "source": [ + "m1 = sum(number_diff_1) / len(number_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in number_diff_1) / len(number_diff_1)\n", + "\n", + "m5 = sum(number_diff_5) / len(number_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in number_diff_5) / len(number_diff_5)\n", + "\n", + "\n", + "m10 = sum(number_diff_10) / len(number_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in number_diff_10) / len(number_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "6987d059-7449-44a0-a3c2-8605362a18a0", + "metadata": {}, + "outputs": [], + "source": [ + "df_knn_number = pd.DataFrame.from_dict({'1%_number': [m1, var_res1],\n", + " '5%_number': [m5, var_res5],\n", + " '10%_number': [m10, var_res10]}, orient='index')\n", + "df_knn_number.columns=['diff. list Mean(KNN)', 'diff. list Var.(KNN)']" + ] + }, + { + "cell_type": "markdown", + "id": "8d1efbf1-61d6-43e1-9a4a-4af137e081c9", + "metadata": {}, + "source": [ + "## What is Mean imputation?\n", + "Mean imputation (MI) is a method in which the mean of the observed values for each variable is computed and the missing values for that variable are imputed by this mean." + ] + }, + { + "cell_type": "markdown", + "id": "41740e20-5dae-403e-a83b-94c91469fcc3", + "metadata": {}, + "source": [ + "### Perform MEAN based imputation" + ] + }, + { + "cell_type": "markdown", + "id": "17b69478-e97c-41b9-828a-eefbb46eb161", + "metadata": {}, + "source": [ + "#### Before mean imputation % missing" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "5a828216-8f1a-4157-8141-77e6c929f57a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 1.0\n", + "number_copy_5_percent number_copy_5_percent 5.0\n", + "number_copy_10_percent number_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "df_number2 = df_number.copy(deep=True)\n", + "print(get_percent_missing(df_number2))" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "1e137676-9f01-44b9-8a84-50d03a89436b", + "metadata": {}, + "outputs": [], + "source": [ + "df_number2['number_copy_1_percent'] = df_number2['number_copy_1_percent'].fillna(df_number2['number_copy_1_percent'].mean())\n", + "df_number2['number_copy_5_percent'] = df_number2['number_copy_5_percent'].fillna(df_number2['number_copy_5_percent'].mean())\n", + "df_number2['number_copy_10_percent'] = df_number2['number_copy_10_percent'].fillna(df_number2['number_copy_10_percent'].mean())" + ] + }, + { + "cell_type": "markdown", + "id": "8da82021-d96a-46ac-81df-035977cb5497", + "metadata": {}, + "source": [ + "#### After mean impute % missing " + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "669c14bd-f920-47db-8476-1cd1b4f4f5bb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "number number 0.0\n", + "number_copy_1_percent number_copy_1_percent 0.0\n", + "number_copy_5_percent number_copy_5_percent 0.0\n", + "number_copy_10_percent number_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_number2))" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "ccb60d18-b24e-4211-9947-46ee0bcc06fe", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percentnumber_copy_5_percentnumber_copy_10_percent
2930.5832310.5832310.5832310.583231
4610.8670350.8670350.8670350.867035
8750.6762280.6762280.6762280.676228
9990.7714420.7714420.7714420.771442
750.9090500.9090500.9090500.909050
980.6295830.6295830.6295830.629583
3810.1816140.1816140.1816140.181614
5920.5231090.5231090.5231090.523109
1550.0380740.0380740.0380740.038074
6300.8692000.8692000.8692000.869200
\n", + "
" + ], + "text/plain": [ + " number number_copy_1_percent number_copy_5_percent \\\n", + "293 0.583231 0.583231 0.583231 \n", + "461 0.867035 0.867035 0.867035 \n", + "875 0.676228 0.676228 0.676228 \n", + "999 0.771442 0.771442 0.771442 \n", + "75 0.909050 0.909050 0.909050 \n", + "98 0.629583 0.629583 0.629583 \n", + "381 0.181614 0.181614 0.181614 \n", + "592 0.523109 0.523109 0.523109 \n", + "155 0.038074 0.038074 0.038074 \n", + "630 0.869200 0.869200 0.869200 \n", + "\n", + " number_copy_10_percent \n", + "293 0.583231 \n", + "461 0.867035 \n", + "875 0.676228 \n", + "999 0.771442 \n", + "75 0.909050 \n", + "98 0.629583 \n", + "381 0.181614 \n", + "592 0.523109 \n", + "155 0.038074 \n", + "630 0.869200 " + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_number2.sample(10)" + ] + }, + { + "cell_type": "markdown", + "id": "88d89795-0ae9-4f37-89cd-b24d36658588", + "metadata": {}, + "source": [ + "#### Create a list of difference - MEAN" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "530979d5-52c4-473d-95f3-754c460a7ab6", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "number_diff_1_mean = []\n", + "number_diff_5_mean = []\n", + "number_diff_10_mean = []\n", + "count = 0\n", + "\n", + "for i in number_1_idx:\n", + " count +=1\n", + " diff1 = abs(df_number2['number_copy_1_percent'][i] - df_number2['number'][i])\n", + " number_diff_1_mean.append(diff1)\n", + " \n", + "\n", + "for i in number_5_idx:\n", + " diff5 = abs(df_number2['number_copy_5_percent'][i] - df_number2['number'][i])\n", + " number_diff_5_mean.append(diff5)\n", + "\n", + "for i in number_10_idx:\n", + " diff10 = abs(df_number2['number_copy_10_percent'][i] - df_number2['number'][i])\n", + " number_diff_10_mean.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "28dd2494-0175-431e-b4b7-09ee4af1f6a0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(number_diff_1_mean))\n", + "print(len(number_diff_5_mean))\n", + "print(len(number_diff_10_mean))" + ] + }, + { + "cell_type": "markdown", + "id": "4e90251e-4c0a-4e2d-82b1-8764374aed1c", + "metadata": {}, + "source": [ + "### Calculate the mean and var of the list of differences - MEAN Impute" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "682bd76e-4875-4b4d-b90b-91d8a6e492ae", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.29595595666774266 and varience 1% is 0.02234691636534702\n", + "The mean of 5% is 0.2606794287327926 and varience 5% is 0.017948559982927326\n", + "The mean of 10% is 0.2606794287327926 and varience 10% is 0.019225304317791198\n" + ] + } + ], + "source": [ + "m1 = sum(number_diff_1_mean) / len(number_diff_1_mean)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in number_diff_1_mean) / len(number_diff_1_mean)\n", + "\n", + "m5 = sum(number_diff_5_mean) / len(number_diff_5_mean)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in number_diff_5_mean) / len(number_diff_5_mean)\n", + "\n", + "\n", + "m10 = sum(number_diff_10_mean) / len(number_diff_10_mean)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in number_diff_10_mean) / len(number_diff_10_mean)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "1f41880d-3e7d-48c9-8744-7e47ccae3c17", + "metadata": {}, + "outputs": [], + "source": [ + "df_MI_number = pd.DataFrame.from_dict({'1%_number': [m1, var_res1],\n", + " '5%_number': [m5, var_res5],\n", + " '10%_number': [m10, var_res10]}, orient='index')\n", + "df_MI_number.columns=['diff. list Mean(MI)', 'diff. list Var.(MI)']" + ] + }, + { + "cell_type": "markdown", + "id": "ec64b079-db97-429c-ae3a-519eec91db3f", + "metadata": {}, + "source": [ + "## KNN and MEAN columns side by side" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "d74b0e73-e3f0-4107-806d-c5d5a50aab9a", + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import display_html\n", + "from itertools import chain,cycle\n", + "def display_side_by_side(*args,titles=cycle([''])):\n", + " html_str=''\n", + " for df,title in zip(args, chain(titles,cycle(['
'])) ):\n", + " html_str+=''\n", + " html_str+=f'

{title}

'\n", + " html_str+=df.to_html().replace('table','table style=\"display:inline\"')\n", + " html_str+=''\n", + " display_html(html_str,raw=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "747a487f-cbc4-467a-9bc7-b0856dbb6576", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from IPython.display import display, HTML\n", + "\n", + "CSS = \"\"\"\n", + ".output {\n", + " flex-direction: row;\n", + "}\n", + "\"\"\"\n", + "\n", + "HTML(''.format(CSS))" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "d24551d1-cd58-4a41-8262-873fe5034272", + "metadata": {}, + "outputs": [], + "source": [ + "# https://github.com/epmoyer/ipy_table/issues/24\n", + "\n", + "from IPython.core.display import HTML\n", + "\n", + "def multi_table(table_list):\n", + " ''' Acceps a list of IpyTable objects and returns a table which contains each IpyTable in a cell\n", + " '''\n", + " return HTML(\n", + " '' + \n", + " ''.join(['' for table in table_list]) +\n", + " '
' + table._repr_html_() + '
'\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "8a8daa30-3abf-4315-ae58-f9171ff000d5", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[103, 272, 302, 441, 542]\n" + ] + } + ], + "source": [ + "print(number_1_idx[:5])" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "da6b1646-2417-42b7-bc8f-d3b0be85c61b", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1 = imputed_number_df.loc[:, [\"number\", \"number_copy_1_percent\"]]\n", + "compare_5 = imputed_number_df.loc[:, [\"number\", \"number_copy_5_percent\"]]\n", + "compare_10 = imputed_number_df.loc[:, [\"number\", \"number_copy_10_percent\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "380b94cf-264f-4a41-bb1d-ac272354073f", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1_df = compare_1.iloc[number_1_idx]\n", + "compare_5_df = compare_5.iloc[number_5_idx]\n", + "compare_10_df = compare_10.iloc[number_10_idx]" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "e5b21e71-0ddd-4c60-b931-b384d65230dd", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1_mean = df_number2.loc[:, [\"number\", \"number_copy_1_percent\"]]\n", + "compare_5_mean = df_number2.loc[:, [\"number\", \"number_copy_5_percent\"]]\n", + "compare_10_mean = df_number2.loc[:, [\"number\", \"number_copy_10_percent\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "29be3554-8129-4f0c-bad6-1270b7c6c05b", + "metadata": {}, + "outputs": [], + "source": [ + "compare_1_mean_df = compare_1_mean.iloc[number_1_idx]\n", + "compare_5_mean_df = compare_5_mean.iloc[number_5_idx]\n", + "compare_10_mean_df = compare_10_mean.iloc[number_10_idx]" + ] + }, + { + "cell_type": "markdown", + "id": "72a3bc3c-0f91-49ad-bf03-dc4b7ace265d", + "metadata": {}, + "source": [ + "#### **number 1% KNN Impute VS number 1% Mean Impute**" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "6fd11f89-9f4b-49b3-b114-1ab3b461f180", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percent
1030.9155540.915539
2720.8994970.899795
3020.0912760.090500
4410.0508740.050914
5420.7447440.744208
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_1_percent
1030.9155540.493992
2720.8994970.493992
3020.0912760.493992
4410.0508740.493992
5420.7447440.493992
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([compare_1_df.head(), compare_1_mean_df.head()])" + ] + }, + { + "cell_type": "markdown", + "id": "e1fc9d1c-53ef-42d3-809b-d68051057e48", + "metadata": {}, + "source": [ + "#### **number 5% KNN Impute VS number 5% Mean Impute**" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "a97c1530-2e50-48d2-a7e0-89fc70f648e5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_5_percent
60.0404510.039472
140.8520260.849692
160.2133430.212438
490.6082030.609078
640.9735740.972234
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_5_percent
60.0404510.49266
140.8520260.49266
160.2133430.49266
490.6082030.49266
640.9735740.49266
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([compare_5_df.head(), compare_5_mean_df.head()])" + ] + }, + { + "cell_type": "markdown", + "id": "1e732ac9-faf7-4457-baef-ac9c4976598c", + "metadata": {}, + "source": [ + "#### **number 10% KNN Impute VS number 10% Mean Impute**" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "f2d22e8f-5a0b-48c0-9150-a391d48e93b2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_10_percent
100.2057550.206019
160.2133430.212724
270.7387040.737446
290.3225770.322495
430.4038660.404988
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
numbernumber_copy_10_percent
100.2057550.50025
160.2133430.50025
270.7387040.50025
290.3225770.50025
430.4038660.50025
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([compare_10_df.head(), compare_10_mean_df.head()])" + ] + }, + { + "cell_type": "markdown", + "id": "cc817314-971f-4abf-a56e-9830a5cf0329", + "metadata": {}, + "source": [ + "# 1.2 Random Numbers dataset Results - KNN and MEAN" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "c4ebb2fe-34e9-4bd2-bf53-9392e5d05e52", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN)diff. list Var.(KNN)
1%_number0.0005852.970798e-07
5%_number0.0007574.329913e-07
10%_number0.0006614.035197e-07
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI)diff. list Var.(MI)
1%_number0.2959560.022347
5%_number0.2606790.017949
10%_number0.2424770.019225
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([df_knn_number, df_MI_number])" + ] + }, + { + "cell_type": "markdown", + "id": "177bab9a-d501-479d-bbe8-d0c93926a24d", + "metadata": {}, + "source": [ + "Results : We can see here that KNN performed much better than the mean imputation since KNN will use the method of finding the nearest neighbour. The error in the actual and the imputed value is almost close to zero which signifies that this method is actually predicting and imputing correct values." + ] + }, + { + "cell_type": "markdown", + "id": "08586561-e3a5-4d15-a1c0-b8d71731a84a", + "metadata": {}, + "source": [ + "# 2.1 Housing Dataset " + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "c05f4dd5-4cdc-4617-939a-2e22ec859af1", + "metadata": {}, + "outputs": [], + "source": [ + "housing_data = pd.read_csv('https://raw.githubusercontent.com/SheshNGupta/datasets/main/train.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "8564d163-97ce-44da-8d3c-6f8cd9c1d0a1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
74074170RM60.09600PaveGrvlRegLvlAllPub...0NaNGdPrvNaN052007WDAbnorml132000
1209121020RL85.010182PaveNaNIR1LvlAllPub...0NaNNaNNaN052006NewPartial290000
646560RLNaN9375PaveNaNRegLvlAllPub...0NaNGdPrvNaN022009WDNormal219500
20820960RLNaN14364PaveNaNIR1LowAllPub...0NaNNaNNaN042007WDNormal277000
43643750RM40.04400PaveNaNRegLvlAllPub...0NaNNaNNaN0102006WDNormal116000
192020RL70.07560PaveNaNRegLvlAllPub...0NaNMnPrvNaN052009CODAbnorml139000
14491450180RM21.01533PaveNaNRegLvlAllPub...0NaNNaNNaN082006WDAbnorml92000
44945050RM50.06000PaveNaNRegLvlAllPub...0NaNNaNNaN062007WDNormal120000
1185118650RL60.09738PaveNaNRegLvlAllPub...0NaNNaNNaN032006WDNormal104900
10231024120RL43.03182PaveNaNRegLvlAllPub...0NaNNaNNaN052008WDNormal191000
\n", + "

10 rows × 81 columns

\n", + "
" + ], + "text/plain": [ + " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", + "740 741 70 RM 60.0 9600 Pave Grvl Reg \n", + "1209 1210 20 RL 85.0 10182 Pave NaN IR1 \n", + "64 65 60 RL NaN 9375 Pave NaN Reg \n", + "208 209 60 RL NaN 14364 Pave NaN IR1 \n", + "436 437 50 RM 40.0 4400 Pave NaN Reg \n", + "19 20 20 RL 70.0 7560 Pave NaN Reg \n", + "1449 1450 180 RM 21.0 1533 Pave NaN Reg \n", + "449 450 50 RM 50.0 6000 Pave NaN Reg \n", + "1185 1186 50 RL 60.0 9738 Pave NaN Reg \n", + "1023 1024 120 RL 43.0 3182 Pave NaN Reg \n", + "\n", + " LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal \\\n", + "740 Lvl AllPub ... 0 NaN GdPrv NaN 0 \n", + "1209 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "64 Lvl AllPub ... 0 NaN GdPrv NaN 0 \n", + "208 Low AllPub ... 0 NaN NaN NaN 0 \n", + "436 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "19 Lvl AllPub ... 0 NaN MnPrv NaN 0 \n", + "1449 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "449 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "1185 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "1023 Lvl AllPub ... 0 NaN NaN NaN 0 \n", + "\n", + " MoSold YrSold SaleType SaleCondition SalePrice \n", + "740 5 2007 WD Abnorml 132000 \n", + "1209 5 2006 New Partial 290000 \n", + "64 2 2009 WD Normal 219500 \n", + "208 4 2007 WD Normal 277000 \n", + "436 10 2006 WD Normal 116000 \n", + "19 5 2009 COD Abnorml 139000 \n", + "1449 8 2006 WD Abnorml 92000 \n", + "449 6 2007 WD Normal 120000 \n", + "1185 3 2006 WD Normal 104900 \n", + "1023 5 2008 WD Normal 191000 \n", + "\n", + "[10 rows x 81 columns]" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data.sample(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "bd81975c-0a21-414b-8e20-3564d35b9f9b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "663" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['SalePrice'].nunique()" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "67d1046e-a1ad-412e-a7e8-a0d51729cec7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1073" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['LotArea'].nunique()" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "64b05e52-72dc-4f7d-aca3-d043036b4d2f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "count 1460.000000\n", + "mean 180921.195890\n", + "std 79442.502883\n", + "min 34900.000000\n", + "25% 129975.000000\n", + "50% 163000.000000\n", + "75% 214000.000000\n", + "max 755000.000000\n", + "Name: SalePrice, dtype: float64" + ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['SalePrice'].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "b7e9928c-4785-4ee1-8150-cd0fa1ef3325", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "count 1460.000000\n", + "mean 10516.828082\n", + "std 9981.264932\n", + "min 1300.000000\n", + "25% 7553.500000\n", + "50% 9478.500000\n", + "75% 11601.500000\n", + "max 215245.000000\n", + "Name: LotArea, dtype: float64" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "housing_data['LotArea'].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "20149f80-07dc-4eaa-8d0e-7de6612a7dce", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "Id Id 0.000000\n", + "MSSubClass MSSubClass 0.000000\n", + "MSZoning MSZoning 0.000000\n", + "LotFrontage LotFrontage 17.739726\n", + "LotArea LotArea 0.000000\n", + "Street Street 0.000000\n", + "Alley Alley 93.767123\n", + "LotShape LotShape 0.000000\n", + "LandContour LandContour 0.000000\n", + "Utilities Utilities 0.000000\n", + "LotConfig LotConfig 0.000000\n", + "LandSlope LandSlope 0.000000\n", + "Neighborhood Neighborhood 0.000000\n", + "Condition1 Condition1 0.000000\n", + "Condition2 Condition2 0.000000\n", + "BldgType BldgType 0.000000\n", + "HouseStyle HouseStyle 0.000000\n", + "OverallQual OverallQual 0.000000\n", + "OverallCond OverallCond 0.000000\n", + "YearBuilt YearBuilt 0.000000\n", + "YearRemodAdd YearRemodAdd 0.000000\n", + "RoofStyle RoofStyle 0.000000\n", + "RoofMatl RoofMatl 0.000000\n", + "Exterior1st Exterior1st 0.000000\n", + "Exterior2nd Exterior2nd 0.000000\n", + "MasVnrType MasVnrType 0.547945\n", + "MasVnrArea MasVnrArea 0.547945\n", + "ExterQual ExterQual 0.000000\n", + "ExterCond ExterCond 0.000000\n", + "Foundation Foundation 0.000000\n", + "BsmtQual BsmtQual 2.534247\n", + "BsmtCond BsmtCond 2.534247\n", + "BsmtExposure BsmtExposure 2.602740\n", + "BsmtFinType1 BsmtFinType1 2.534247\n", + "BsmtFinSF1 BsmtFinSF1 0.000000\n", + "BsmtFinType2 BsmtFinType2 2.602740\n", + "BsmtFinSF2 BsmtFinSF2 0.000000\n", + "BsmtUnfSF BsmtUnfSF 0.000000\n", + "TotalBsmtSF TotalBsmtSF 0.000000\n", + "Heating Heating 0.000000\n", + "HeatingQC HeatingQC 0.000000\n", + "CentralAir CentralAir 0.000000\n", + "Electrical Electrical 0.068493\n", + "1stFlrSF 1stFlrSF 0.000000\n", + "2ndFlrSF 2ndFlrSF 0.000000\n", + "LowQualFinSF LowQualFinSF 0.000000\n", + "GrLivArea GrLivArea 0.000000\n", + "BsmtFullBath BsmtFullBath 0.000000\n", + "BsmtHalfBath BsmtHalfBath 0.000000\n", + "FullBath FullBath 0.000000\n", + "HalfBath HalfBath 0.000000\n", + "BedroomAbvGr BedroomAbvGr 0.000000\n", + "KitchenAbvGr KitchenAbvGr 0.000000\n", + "KitchenQual KitchenQual 0.000000\n", + "TotRmsAbvGrd TotRmsAbvGrd 0.000000\n", + "Functional Functional 0.000000\n", + "Fireplaces Fireplaces 0.000000\n", + "FireplaceQu FireplaceQu 47.260274\n", + "GarageType GarageType 5.547945\n", + "GarageYrBlt GarageYrBlt 5.547945\n", + "GarageFinish GarageFinish 5.547945\n", + "GarageCars GarageCars 0.000000\n", + "GarageArea GarageArea 0.000000\n", + "GarageQual GarageQual 5.547945\n", + "GarageCond GarageCond 5.547945\n", + "PavedDrive PavedDrive 0.000000\n", + "WoodDeckSF WoodDeckSF 0.000000\n", + "OpenPorchSF OpenPorchSF 0.000000\n", + "EnclosedPorch EnclosedPorch 0.000000\n", + "3SsnPorch 3SsnPorch 0.000000\n", + "ScreenPorch ScreenPorch 0.000000\n", + "PoolArea PoolArea 0.000000\n", + "PoolQC PoolQC 99.520548\n", + "Fence Fence 80.753425\n", + "MiscFeature MiscFeature 96.301370\n", + "MiscVal MiscVal 0.000000\n", + "MoSold MoSold 0.000000\n", + "YrSold YrSold 0.000000\n", + "SaleType SaleType 0.000000\n", + "SaleCondition SaleCondition 0.000000\n", + "SalePrice SalePrice 0.000000\n" + ] + } + ], + "source": [ + "pd.set_option('display.max_rows', None)\n", + "print(get_percent_missing(housing_data))" + ] + }, + { + "cell_type": "markdown", + "id": "c8eb3ee3-085d-4b41-9a5f-c83a3805f870", + "metadata": {}, + "source": [ + "#### Using Sale price coloumn for KNN and MEAN imputation task" + ] + }, + { + "cell_type": "markdown", + "id": "451c79fb-17ba-40ac-8f0b-87a8b2ec4837", + "metadata": {}, + "source": [ + "#### Non Scaled dataframe Sale Price - take first 1000 rows" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "id": "9cc1f97f-1b24-4570-8f6a-30426bd79269", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
0208500208500208500208500
1181500181500181500181500
2223500223500223500223500
3140000140000140000140000
4250000250000250000250000
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 208500 208500 208500 208500\n", + "1 181500 181500 181500 181500\n", + "2 223500 223500 223500 223500\n", + "3 140000 140000 140000 140000\n", + "4 250000 250000 250000 250000" + ] + }, + "execution_count": 58, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice = housing_data[['SalePrice']][:1000]\n", + "df_saleprice['sp_copy_1_percent'] = df_saleprice[['SalePrice']]\n", + "df_saleprice['sp_copy_5_percent'] = df_saleprice[['SalePrice']]\n", + "df_saleprice['sp_copy_10_percent'] = df_saleprice[['SalePrice']]\n", + "df_saleprice.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "id": "f462f065-9f37-44f1-a22e-92e610dae2e9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1000" + ] + }, + "execution_count": 59, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(df_saleprice)" + ] + }, + { + "cell_type": "markdown", + "id": "03407bbd-f8a7-4f6c-a7c3-64a865ed3f7e", + "metadata": {}, + "source": [ + "#### Scaled Dataframe SalePrice - take first 1000 rows" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "e461b1ef-df2c-410f-aea8-abe954fa9afd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
00.2410780.2410780.2410780.241078
10.2035830.2035830.2035830.203583
20.2619080.2619080.2619080.261908
30.1459520.1459520.1459520.145952
40.2987090.2987090.2987090.298709
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 0.241078 0.241078 0.241078 0.241078\n", + "1 0.203583 0.203583 0.203583 0.203583\n", + "2 0.261908 0.261908 0.261908 0.261908\n", + "3 0.145952 0.145952 0.145952 0.145952\n", + "4 0.298709 0.298709 0.298709 0.298709" + ] + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "scaler = MinMaxScaler()\n", + "df_saleprice_scaled = df_saleprice.copy(deep=True)\n", + "df_saleprice_scaled = pd.DataFrame(scaler.fit_transform(df_saleprice_scaled), columns = df_saleprice_scaled.columns)\n", + "df_saleprice_scaled.head()" + ] + }, + { + "cell_type": "markdown", + "id": "a66683c4-f66a-4aa1-ab8a-f28087b60b6c", + "metadata": {}, + "source": [ + "#### Check % missing values in this dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "0075fa0f-4b82-4089-ab81-e5282497c4a3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice))" + ] + }, + { + "cell_type": "markdown", + "id": "619ef99f-55c0-422c-aaa8-73cd71fcf2fb", + "metadata": {}, + "source": [ + "#### Create 1%, 5% and 10% missing data" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "82df5098-4176-4fba-922f-ca84c0466f2a", + "metadata": {}, + "outputs": [], + "source": [ + "create_missing(df_saleprice, 0.01, 'sp_copy_1_percent')\n", + "create_missing(df_saleprice, 0.05, 'sp_copy_5_percent')\n", + "create_missing(df_saleprice, 0.1, 'sp_copy_10_percent')" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "0e90ae04-cd10-4507-a851-c187010f0be0", + "metadata": {}, + "outputs": [], + "source": [ + "create_missing(df_saleprice_scaled, 0.01, 'sp_copy_1_percent')\n", + "create_missing(df_saleprice_scaled, 0.05, 'sp_copy_5_percent')\n", + "create_missing(df_saleprice_scaled, 0.1, 'sp_copy_10_percent')" + ] + }, + { + "cell_type": "markdown", + "id": "a8237a82-5a33-4ce9-b4c7-a48ede4f5fef", + "metadata": {}, + "source": [ + "#### With/Without scaling dataframe missing values check" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "2794306d-89c7-4518-8979-9edb3d9441b1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice))" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "8351dbe2-b388-451d-9238-52c4ccabd425", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice_scaled))" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "id": "b11b093f-110b-4ef3-9d00-ac4fed45a956", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10" + ] + }, + "execution_count": 66, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice['sp_copy_1_percent'].isna().sum()" + ] + }, + { + "cell_type": "markdown", + "id": "360e0010-e085-435c-8902-80c6a7ea78be", + "metadata": {}, + "source": [ + "#### Store indices of missing values" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "e546096c-ce35-448e-aa97-0943d3535a87", + "metadata": {}, + "outputs": [], + "source": [ + "# Store Index of NaN values in each coloumns\n", + "sp_1_idx = list(np.where(df_saleprice['sp_copy_1_percent'].isna())[0])\n", + "sp_5_idx = list(np.where(df_saleprice['sp_copy_5_percent'].isna())[0])\n", + "sp_10_idx = list(np.where(df_saleprice['sp_copy_10_percent'].isna())[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "d409e2a5-b3a9-4ae1-9b17-88b7c642692d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_1_idx))\n", + "print(len(sp_5_idx))\n", + "print(len(sp_10_idx))" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "5839460a-e736-42e9-9a13-d5bab5683115", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Length of sp_1_idx is 10 and it contains 1.0% of total data in column | Total rows: 1000\n", + "Length of sp_5_idx is 50 and it contains 5.0% of total data in column | Total rows: 1000\n", + "Length of sp_10_idx is 100 and it contains 10.0% of total data in column | Total rows: 1000\n" + ] + } + ], + "source": [ + "print(f\"Length of sp_1_idx is {len(sp_1_idx)} and it contains {(len(sp_1_idx)/len(df_saleprice['sp_copy_1_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")\n", + "print(f\"Length of sp_5_idx is {len(sp_5_idx)} and it contains {(len(sp_5_idx)/len(df_saleprice['sp_copy_5_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")\n", + "print(f\"Length of sp_10_idx is {len(sp_10_idx)} and it contains {(len(sp_10_idx)/len(df_saleprice['sp_copy_10_percent']))*100}% of total data in column | Total rows: {len(df_saleprice['sp_copy_1_percent'])}\")" + ] + }, + { + "cell_type": "markdown", + "id": "c1464c79-c0a9-4640-92dd-f0d5131634ab", + "metadata": {}, + "source": [ + "### Perform KNN to df_saleprice and df_saleprice_scaled dataframe" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "08fa2436-ffb8-4b5d-a7a1-9e2d63b14562", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice1 = df_saleprice.copy(deep=True)\n", + "imputer = KNNImputer(n_neighbors=5)\n", + "imputed_saleprice_df = pd.DataFrame(imputer.fit_transform(df_saleprice1), columns = df_saleprice1.columns)" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "205c7a96-3f1c-42a4-91de-f22f15ce9cb2", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice_scaled1 = df_saleprice_scaled.copy(deep=True)\n", + "imputer = KNNImputer(n_neighbors=5)\n", + "imputed_saleprice_scaled_df = pd.DataFrame(imputer.fit_transform(df_saleprice_scaled1), columns = df_saleprice_scaled1.columns)" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "id": "a482f58d-73b6-423c-b97a-140884830a0f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
0208500.0208500.0208500.0208500.0
1181500.0181500.0181500.0181500.0
2223500.0223500.0223500.0223500.0
3140000.0140000.0140000.0140000.0
4250000.0250000.0250000.0250000.0
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 208500.0 208500.0 208500.0 208500.0\n", + "1 181500.0 181500.0 181500.0 181500.0\n", + "2 223500.0 223500.0 223500.0 223500.0\n", + "3 140000.0 140000.0 140000.0 140000.0\n", + "4 250000.0 250000.0 250000.0 250000.0" + ] + }, + "execution_count": 72, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imputed_saleprice_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "id": "11f8f5ff-f06d-4ec2-a4e3-1324e807a537", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
00.2410780.2410780.2410780.241078
10.2035830.2035830.2035830.203583
20.2619080.2619080.2619080.261908
30.1459520.1459520.1459520.145952
40.2987090.2987090.2987090.298709
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "0 0.241078 0.241078 0.241078 0.241078\n", + "1 0.203583 0.203583 0.203583 0.203583\n", + "2 0.261908 0.261908 0.261908 0.261908\n", + "3 0.145952 0.145952 0.145952 0.145952\n", + "4 0.298709 0.298709 0.298709 0.298709" + ] + }, + "execution_count": 73, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "imputed_saleprice_scaled_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "d9fd7fa1-4ce0-43be-9955-55ef759d930b", + "metadata": {}, + "source": [ + "#### Check % missing in saleprice and saleprice_scaled DF" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "id": "9ed0d36a-9584-4e3b-9201-2ac36827bce9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(imputed_saleprice_df))" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "7c842fce-bbd5-4c2c-bb1a-db5df92f6315", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(imputed_saleprice_scaled_df))" + ] + }, + { + "cell_type": "markdown", + "id": "ac47abb1-df5f-4686-bc67-6617140c008c", + "metadata": {}, + "source": [ + "#### Store the list of disfferences between Org. and Imputed Value" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "99e04554-568d-4efa-a110-768b50dfaee6", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_diff_1 = []\n", + "sp_diff_5 = []\n", + "sp_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(imputed_saleprice_df['sp_copy_1_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", + " sp_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(imputed_saleprice_df['sp_copy_5_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", + " sp_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(imputed_saleprice_df['sp_copy_10_percent'][i] - imputed_saleprice_df['SalePrice'][i])\n", + " sp_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "92204f8a-497c-470d-a770-59165d226cc9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_diff_1))\n", + "print(len(sp_diff_5))\n", + "print(len(sp_diff_10))" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "b8875fff-0289-4dd9-92c1-78dc9b730d22", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_scaled_diff_1 = []\n", + "sp_scaled_diff_5 = []\n", + "sp_scaled_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(imputed_saleprice_scaled_df['sp_copy_1_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", + " sp_scaled_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(imputed_saleprice_scaled_df['sp_copy_5_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", + " sp_scaled_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(imputed_saleprice_scaled_df['sp_copy_10_percent'][i] - imputed_saleprice_scaled_df['SalePrice'][i])\n", + " sp_scaled_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "id": "40192344-79a4-444c-a12a-2201dc5aa0c1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_scaled_diff_1))\n", + "print(len(sp_scaled_diff_5))\n", + "print(len(sp_scaled_diff_10))" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "a95bd45c-8a2f-4159-8306-399ec18a4c0f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0.0, 0.0, 0.0, 0.0, 0.0]" + ] + }, + "execution_count": 80, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sp_scaled_diff_1[:5]" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "0f73d420-8842-4062-ae17-158a0a25e169", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[0.0, 100.0, 20.0, 0.0, 780.0]" + ] + }, + "execution_count": 81, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sp_diff_1[:5]" + ] + }, + { + "cell_type": "markdown", + "id": "a40fd400-913b-4011-b0b9-dd3ca0d5827a", + "metadata": {}, + "source": [ + "#### Calculate the mean and var of list of diff. KNN - SalePrice" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "80267827-7f73-49ff-b200-27cdb2963756", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 105.0 and varience 1% is 52105.0\n", + "The mean of 5% is 163.0120000000001 and varience 5% is 46018.96385599976\n", + "The mean of 10% is 163.0120000000001 and varience 10% is 3667553.3671999993\n" + ] + } + ], + "source": [ + "m1 = sum(sp_diff_1) / len(sp_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_diff_1) / len(sp_diff_1)\n", + "\n", + "m5 = sum(sp_diff_5) / len(sp_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_diff_5) / len(sp_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_diff_10) / len(sp_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_diff_10) / len(sp_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "358545ff-2fcf-4c99-9049-4eaf6dd110bd", + "metadata": {}, + "outputs": [], + "source": [ + "df_knn_saleprice = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", + " '5%_saleprice': [m5, var_res5],\n", + " '10%_saleprice': [m10, var_res10]}, orient='index')\n", + "df_knn_saleprice.columns=['diff. list Mean(KNN)', 'diff. list Var.(KNN)']" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "3714c8f9-58db-40a7-b5a2-6bb7e788b734", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN)diff. list Var.(KNN)
1%_saleprice105.0005.210500e+04
5%_saleprice163.0124.601896e+04
10%_saleprice470.8003.667553e+06
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(KNN) diff. list Var.(KNN)\n", + "1%_saleprice 105.000 5.210500e+04\n", + "5%_saleprice 163.012 4.601896e+04\n", + "10%_saleprice 470.800 3.667553e+06" + ] + }, + "execution_count": 84, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_knn_saleprice" + ] + }, + { + "cell_type": "markdown", + "id": "fd7608a8-c5fb-425c-a340-af01801ee349", + "metadata": {}, + "source": [ + "#### Calculate the mean and var of list of diff. KNN - SalePrice scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "bb03017f-3d91-48d9-8ebf-7cb5c25fadc3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.0 and varience 1% is 0.0\n", + "The mean of 5% is 1.2498264129982007e-05 and varience 5% is 7.654123706876951e-09\n", + "The mean of 10% is 1.2498264129982007e-05 and varience 10% is 2.9738417673284677e-06\n" + ] + } + ], + "source": [ + "m1 = sum(sp_scaled_diff_1) / len(sp_scaled_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_scaled_diff_1) / len(sp_scaled_diff_1)\n", + "\n", + "m5 = sum(sp_scaled_diff_5) / len(sp_scaled_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_scaled_diff_5) / len(sp_scaled_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_scaled_diff_10) / len(sp_scaled_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_scaled_diff_10) / len(sp_scaled_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "290d8db2-c9f4-4028-ab44-ad68c9e7b3c5", + "metadata": {}, + "outputs": [], + "source": [ + "df_knn_saleprice_scaled = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", + " '5%_saleprice': [m5, var_res5],\n", + " '10%_saleprice': [m10, var_res10]}, orient='index')\n", + "df_knn_saleprice_scaled.columns=['diff. list Mean(KNN) scaled', 'diff. list Var.(KNN) scaled']" + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "id": "89347fd7-d87d-42bb-b375-a75417c395de", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN) scaleddiff. list Var.(KNN) scaled
1%_saleprice0.0000000.000000e+00
5%_saleprice0.0000127.654124e-09
10%_saleprice0.0002652.973842e-06
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(KNN) scaled diff. list Var.(KNN) scaled\n", + "1%_saleprice 0.000000 0.000000e+00\n", + "5%_saleprice 0.000012 7.654124e-09\n", + "10%_saleprice 0.000265 2.973842e-06" + ] + }, + "execution_count": 87, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_knn_saleprice_scaled" + ] + }, + { + "cell_type": "markdown", + "id": "c984dc69-f85f-4f1b-8c94-4afb48c1c8db", + "metadata": {}, + "source": [ + "### Perform MEAN imputation" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "008bc14f-45e7-42d8-b843-2fee7bcf26c2", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice2 = df_saleprice.copy(deep=True)\n", + "df_saleprice_scaled2 = df_saleprice_scaled.copy(deep=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "id": "bd71dc1a-f137-46ed-bf2b-f3d87fd4b6a0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice2))" + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "id": "46237cfd-6361-466f-b66f-32f5940149d6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 1.0\n", + "sp_copy_5_percent sp_copy_5_percent 5.0\n", + "sp_copy_10_percent sp_copy_10_percent 10.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice_scaled2))" + ] + }, + { + "cell_type": "markdown", + "id": "64465299-5620-47b9-a28d-afb5494f279e", + "metadata": {}, + "source": [ + "#### Impute Mean values in missing for saleprice and saleprice_scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "id": "28cf6b75-eebf-4758-94ec-4b3536f2c659", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice2['sp_copy_1_percent'] = df_saleprice2['sp_copy_1_percent'].fillna(df_saleprice2['sp_copy_1_percent'].mean())\n", + "df_saleprice2['sp_copy_5_percent'] = df_saleprice2['sp_copy_5_percent'].fillna(df_saleprice2['sp_copy_5_percent'].mean())\n", + "df_saleprice2['sp_copy_10_percent'] = df_saleprice2['sp_copy_10_percent'].fillna(df_saleprice2['sp_copy_10_percent'].mean())" + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "id": "2409dd8c-3cd0-4742-b0ac-14dea1fdb504", + "metadata": {}, + "outputs": [], + "source": [ + "df_saleprice_scaled2['sp_copy_1_percent'] = df_saleprice_scaled2['sp_copy_1_percent'].fillna(df_saleprice_scaled2['sp_copy_1_percent'].mean())\n", + "df_saleprice_scaled2['sp_copy_5_percent'] = df_saleprice_scaled2['sp_copy_5_percent'].fillna(df_saleprice_scaled2['sp_copy_5_percent'].mean())\n", + "df_saleprice_scaled2['sp_copy_10_percent'] = df_saleprice_scaled2['sp_copy_10_percent'].fillna(df_saleprice_scaled2['sp_copy_10_percent'].mean())" + ] + }, + { + "cell_type": "markdown", + "id": "62377754-b682-45e5-8faa-1a4a186bd3c7", + "metadata": {}, + "source": [ + "#### After MEAN imputation - Saleprice and saleprice scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "id": "6c448556-55f4-4685-aed2-6b67d5ad8a2a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice2))" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "id": "d9775fbf-7a72-4352-b446-488e9d25b6a2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " column_name percent_missing\n", + "SalePrice SalePrice 0.0\n", + "sp_copy_1_percent sp_copy_1_percent 0.0\n", + "sp_copy_5_percent sp_copy_5_percent 0.0\n", + "sp_copy_10_percent sp_copy_10_percent 0.0\n" + ] + } + ], + "source": [ + "print(get_percent_missing(df_saleprice_scaled2))" + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "id": "136f87e6-a4af-4229-b36a-695f712deee5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
436116000116000.0116000.0116000.000000
21139400139400.0139400.0139400.000000
618314813314813.0314813.0314813.000000
207141000141000.0141000.0182369.783333
366159000159000.0159000.0159000.000000
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "436 116000 116000.0 116000.0 116000.000000\n", + "21 139400 139400.0 139400.0 139400.000000\n", + "618 314813 314813.0 314813.0 314813.000000\n", + "207 141000 141000.0 141000.0 182369.783333\n", + "366 159000 159000.0 159000.0 159000.000000" + ] + }, + "execution_count": 95, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice2.sample(5)" + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "id": "784cb61c-78f8-4b31-b709-379c50024dca", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
SalePricesp_copy_1_percentsp_copy_5_percentsp_copy_10_percent
4570.3070410.3070410.3070410.201890
8760.1351900.1351900.1351900.135190
3610.1528950.1528950.1528950.152895
6820.1917790.1917790.1917790.201890
5230.2080960.2080960.2080960.208096
\n", + "
" + ], + "text/plain": [ + " SalePrice sp_copy_1_percent sp_copy_5_percent sp_copy_10_percent\n", + "457 0.307041 0.307041 0.307041 0.201890\n", + "876 0.135190 0.135190 0.135190 0.135190\n", + "361 0.152895 0.152895 0.152895 0.152895\n", + "682 0.191779 0.191779 0.191779 0.201890\n", + "523 0.208096 0.208096 0.208096 0.208096" + ] + }, + "execution_count": 96, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_saleprice_scaled2.sample(5)" + ] + }, + { + "cell_type": "markdown", + "id": "33c1f3b7-5afc-45cb-8b43-9682ec87156d", + "metadata": {}, + "source": [ + "#### Create List of differences for saleprice and saleprice_scaled Dataframes" + ] + }, + { + "cell_type": "code", + "execution_count": 97, + "id": "d2faf410-f83e-4ccb-89d4-e6f8c7adffbb", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_mean_diff_1 = []\n", + "sp_mean_diff_5 = []\n", + "sp_mean_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(df_saleprice2['sp_copy_1_percent'][i] - df_saleprice2['SalePrice'][i])\n", + " sp_mean_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(df_saleprice2['sp_copy_5_percent'][i] - df_saleprice2['SalePrice'][i])\n", + " sp_mean_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(df_saleprice2['sp_copy_10_percent'][i] - df_saleprice2['SalePrice'][i])\n", + " sp_mean_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 98, + "id": "789b07c5-530a-4111-8c97-f5297f7da5e4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_mean_diff_1))\n", + "print(len(sp_mean_diff_5))\n", + "print(len(sp_mean_diff_10))" + ] + }, + { + "cell_type": "code", + "execution_count": 99, + "id": "4fec222c-2420-41af-9e2a-d9773e1d6259", + "metadata": {}, + "outputs": [], + "source": [ + "# create list of difference bwtween imputed and orginal value\n", + "\n", + "sp_scaled_mean_diff_1 = []\n", + "sp_scaled_mean_diff_5 = []\n", + "sp_scaled_mean_diff_10 = []\n", + "count = 0\n", + "\n", + "for i in sp_1_idx:\n", + " count +=1\n", + " diff1 = abs(df_saleprice_scaled2['sp_copy_1_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", + " sp_scaled_mean_diff_1.append(diff1)\n", + " \n", + "\n", + "for i in sp_5_idx:\n", + " diff5 = abs(df_saleprice_scaled2['sp_copy_5_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", + " sp_scaled_mean_diff_5.append(diff5)\n", + "\n", + "for i in sp_10_idx:\n", + " diff10 = abs(df_saleprice_scaled2['sp_copy_10_percent'][i] - df_saleprice_scaled2['SalePrice'][i])\n", + " sp_scaled_mean_diff_10.append(diff10)" + ] + }, + { + "cell_type": "code", + "execution_count": 100, + "id": "de9bf1de-68fe-4894-915a-7069b386123f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "10\n", + "50\n", + "100\n" + ] + } + ], + "source": [ + "print(len(sp_scaled_mean_diff_1))\n", + "print(len(sp_scaled_mean_diff_5))\n", + "print(len(sp_scaled_mean_diff_10))" + ] + }, + { + "cell_type": "markdown", + "id": "f7b93757-d1a7-41a1-85fa-3ee77734be5b", + "metadata": {}, + "source": [ + "#### Calculate mean and var of list of diff. - MEAN impute SalePrice" + ] + }, + { + "cell_type": "code", + "execution_count": 101, + "id": "c60d3aad-33f0-48f4-8bb0-f8af45e33e1e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 47198.61696969698 and varience 1% is 634546571.3543438\n", + "The mean of 5% is 54438.20686315788 and varience 5% is 1768876209.3358026\n", + "The mean of 10% is 54438.20686315788 and varience 10% is 2875290913.3009353\n" + ] + } + ], + "source": [ + "m1 = sum(sp_mean_diff_1) / len(sp_mean_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_mean_diff_1) / len(sp_mean_diff_1)\n", + "\n", + "m5 = sum(sp_mean_diff_5) / len(sp_mean_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_mean_diff_5) / len(sp_mean_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_mean_diff_10) / len(sp_mean_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_mean_diff_10) / len(sp_mean_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "id": "e7f6e5cf-4eaa-4bfe-add2-fc7f600941b7", + "metadata": {}, + "outputs": [], + "source": [ + "df_mean_saleprice = pd.DataFrame.from_dict({'1%_saleprice': [m1, var_res1],\n", + " '5%_saleprice': [m5, var_res5],\n", + " '10%_saleprice': [m10, var_res10]}, orient='index')\n", + "df_mean_saleprice.columns=['diff. list Mean(MI)', 'diff. list Var.(MI)']" + ] + }, + { + "cell_type": "code", + "execution_count": 103, + "id": "cc37eeaf-e3cd-4a83-870d-fab7037eeffe", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI)diff. list Var.(MI)
1%_saleprice47198.6169706.345466e+08
5%_saleprice54438.2068631.768876e+09
10%_saleprice58045.6366672.875291e+09
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(MI) diff. list Var.(MI)\n", + "1%_saleprice 47198.616970 6.345466e+08\n", + "5%_saleprice 54438.206863 1.768876e+09\n", + "10%_saleprice 58045.636667 2.875291e+09" + ] + }, + "execution_count": 103, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_mean_saleprice" + ] + }, + { + "cell_type": "markdown", + "id": "f405f073-1b45-47e8-873b-7a9d34ad0e5c", + "metadata": {}, + "source": [ + "#### Calculate mean and var of list of diff. - MEAN impute SalePrice scaled" + ] + }, + { + "cell_type": "code", + "execution_count": 104, + "id": "2516b4f7-6b79-4636-9bd5-0738343ea355", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The mean of 1% is 0.0 and varience 1% is 0.0\n", + "The mean of 5% is 0.0016175777048509216 and varience 5% is 5.557201947380946e-05\n", + "The mean of 10% is 0.0016175777048509216 and varience 10% is 0.004250732648521598\n" + ] + } + ], + "source": [ + "m1 = sum(sp_scaled_mean_diff_1) / len(sp_scaled_mean_diff_1)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res1 = sum((xi - m1) ** 2 for xi in sp_scaled_mean_diff_1) / len(sp_scaled_mean_diff_1)\n", + "\n", + "m5 = sum(sp_scaled_mean_diff_5) / len(sp_scaled_mean_diff_5)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res5 = sum((xii - m5) ** 2 for xii in sp_scaled_mean_diff_5) / len(sp_scaled_mean_diff_5)\n", + "\n", + "\n", + "m10 = sum(sp_scaled_mean_diff_10) / len(sp_scaled_mean_diff_10)\n", + "\n", + "# calculate variance using a list comprehension\n", + "var_res10 = sum((xiii - m10) ** 2 for xiii in sp_scaled_mean_diff_10) / len(sp_scaled_mean_diff_10)\n", + "\n", + "print(f\"The mean of 1% is {m1} and varience 1% is {var_res1}\")\n", + "print(f\"The mean of 5% is {m5} and varience 5% is {var_res5}\")\n", + "print(f\"The mean of 10% is {m5} and varience 10% is {var_res10}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 105, + "id": "fe6a93b8-d6cb-4d7d-856b-ab4ee8fe78fc", + "metadata": {}, + "outputs": [], + "source": [ + "df_mean_saleprice_scaled = pd.DataFrame.from_dict({'1%_saleprice_scaled': [m1, var_res1],\n", + " '5%_saleprice_scaled': [m5, var_res5],\n", + " '10%_saleprice_scaled': [m10, var_res10]}, orient='index')\n", + "df_mean_saleprice_scaled.columns=['diff. list Mean(MI) scaled', 'diff. list Var.(MI) scaled']" + ] + }, + { + "cell_type": "code", + "execution_count": 106, + "id": "e74c35ed-7c2d-44ab-b6c2-4d81c2c6b6bb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI) scaleddiff. list Var.(MI) scaled
1%_saleprice_scaled0.0000000.000000
5%_saleprice_scaled0.0016180.000056
10%_saleprice_scaled0.0189220.004251
\n", + "
" + ], + "text/plain": [ + " diff. list Mean(MI) scaled diff. list Var.(MI) scaled\n", + "1%_saleprice_scaled 0.000000 0.000000\n", + "5%_saleprice_scaled 0.001618 0.000056\n", + "10%_saleprice_scaled 0.018922 0.004251" + ] + }, + "execution_count": 106, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_mean_saleprice_scaled" + ] + }, + { + "cell_type": "markdown", + "id": "876b979a-f5c4-43a7-9ead-d5d866bef078", + "metadata": {}, + "source": [ + "# 2.2 Housing Data Results - KNN and MEAN" + ] + }, + { + "cell_type": "code", + "execution_count": 107, + "id": "e90e9486-280d-4e96-b16a-0c3314eaedc9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN)diff. list Var.(KNN)
1%_saleprice105.0005.210500e+04
5%_saleprice163.0124.601896e+04
10%_saleprice470.8003.667553e+06
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(KNN) scaleddiff. list Var.(KNN) scaled
1%_saleprice0.0000000.000000e+00
5%_saleprice0.0000127.654124e-09
10%_saleprice0.0002652.973842e-06
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI)diff. list Var.(MI)
1%_saleprice47198.6169706.345466e+08
5%_saleprice54438.2068631.768876e+09
10%_saleprice58045.6366672.875291e+09
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
diff. list Mean(MI) scaleddiff. list Var.(MI) scaled
1%_saleprice_scaled0.0000000.000000
5%_saleprice_scaled0.0016180.000056
10%_saleprice_scaled0.0189220.004251
\n", + "
" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 107, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "multi_table([df_knn_saleprice, df_knn_saleprice_scaled, df_mean_saleprice, df_mean_saleprice_scaled])" + ] + }, + { + "cell_type": "markdown", + "id": "e07a4e01-7e4e-4bdb-b6c7-ef2424fc6a80", + "metadata": {}, + "source": [ + "Result: Another takeaway here is that if we use scaling before performing the imputation, the imputation works much better and accuratly. Although the mean imputation provided less accurate results as compared to the KNN imputation, but the accuracy of the imputed values are still better if we use scaling than not using it. KNN imputation on other hand did perform better than mean imputation, however the results are much better if we use scaled dataset." + ] + }, + { + "cell_type": "markdown", + "id": "977be574-18b2-4f80-a019-2a86227a14d6", + "metadata": {}, + "source": [ + "# Conclusion\n", + "1. KNN imputation is performing better than mean imputation\n", + "2. If we use scaled dataset as compared to non scaled dataset, the results are even better (almost close to perfect!)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "764c9bdb-78dc-4287-a527-0e14ff58a5e9", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/Imputation_best_practices/readme.md b/notebooks/Imputation_best_practices/readme.md new file mode 100644 index 0000000..b4d57c1 --- /dev/null +++ b/notebooks/Imputation_best_practices/readme.md @@ -0,0 +1 @@ +This folder will contain the notebook and the data used for demonstrating how to effectively use imputation practices using KNN and mean imputations