Completed by Mangaliso Makhoba.
Overview: This project is using the Titanic Dataset to create a simple statitistical model that will return a conditional survival probabily of a passenger given a condition on a numerical variable from the dataset.
Problem Statement: Build a model that will return a passengers survival chance given a passengers detail as input.
Data: Titanic Kaggle Challenge
Deliverables: Probability
- Statistical Modeling
- Imputation of Missing values
- Probability
- Scikit-learn
- Jupyter Notebook
Ensure that the following packages have been installed and imported.
pip install numpy
pip install pandasFollow instruction on https://docs.anaconda.com/anaconda/install/ to install Anaconda with Jupyter.
Alternatively: VS Code can render Jupyter Notebooks
The structure of this notebook is as follows:
- First, we'll load our data to get a view of the predictor and response variables we will be modeling.
- We determine the number of missing values for a specific column
- We'll then preprocess our data by imputing missing values, mean in numerical features, and mode in categorical feaures.
- We then model the survival probabilty of a passenger given their age, class, gender and so on
A function that determines the number of missing entries for a specified column in the dataset. The function should return an int that corresponds to the number of missing entries in the specified column.
Function Specifications:
- Should take a pandas
DataFrameand acolumn_nameas input and return aintas output. - The
intshould be the number of missing entries in the column. - Should be generalised to be able to work on ANY dataframe.
Expected Outputs:
total_missing(df,'Age') == 177
total_missing(df,'Survived') == 0Write a function that takes in as input a dataframe and a column name, and returns the mean for numerical columns and the mode for non-numerical columns.
Function Specifications:
- The function should take two inputs:
(df, column_name), wheredfis a pandasDataFrame,column_nameis astr. - If the
column_namedoes not exist indf, raise aValueError. - Should return as output the
meanif the specified column is numerical and return a list of themode(s)otherwise. - The mean should be rounded to 2 decimal places.
- If there is more than one
modefor a given non-numerical column, the fuction should return a list of all modes.
Expected Outputs:
calc_mean_mode(df, 'Age') == 29.7
calc_mean_mode(df, 'Embarked') == ['S']We ultimately want to predict the survival chances of the passengers in the testing set. We can start by building a simple model using the data we already have by using conditional probability ! Write a function that returns the survival probability of a passenger, given a condition on a numerical variable from the dataset. The condition will consist of a column_name, a value and a boolean_operator. Possible boolean operators include "<",">", or "==". For example, column_name = "Age", boolean_operator = ">", and value = 40 together form the condition Age > 40.
Function specifications:
- The function should make use of the
df_cleanDataFrameloaded earlier in this notebook. - It should take a numerical
column_namestring, aboolean_operatorstring, and avalueof type string as input. - It should return a survival likelihood as a number between 0 and 1, rounded to 2 decimal places.
- Assume that
column_nameexists indf_clean.
Expected Outputs:
survival_likelihood(df_clean,"Pclass","==","3") == 0.24
survival_likelihood(df_clean,"Age","<","15") == 0.58Finding an appropriate strategy to impute missing values is very important to increasing the accuracy of the model you are building.
Authors: Mangaliso Makhoba, Explore Data Science Academy
Contact: [email protected]
This is project is complete
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.