Skip to content

Latest commit

 

History

History
193 lines (115 loc) · 9.91 KB

File metadata and controls

193 lines (115 loc) · 9.91 KB

Random Forest Model

The random forest model uses bootstrap samples and an ensemble of decision trees for learning.

Questions to be Answered

  1. What features contribute most to being classified as above or below average of the year 2020?
    (average of total cases and of total deaths)

  2. What features contribute most to being classified as above or below average of years 2020 and 2021?
    (average of total cases and of total deaths)

  3. What are the differences?

  4. If the number of features is reduced, what do the differences look like?

Reasons for choosing the model

Benefits

  1. Provides importances of features

  2. Variance is reduced by introducing randomness with replacement which improves the model's fit

  3. Datasets are sufficiently large for this approach

  4. Robust to outliers and nonlinear data

  5. Computation time for a random forest is also a benefit

Drawbacks

  1. It may not always be able to ascertain why a random forest makes its decision. Altough sklearn provides feature importances, their limitations should be considered:

"The impurity-based feature importances computed on tree-based models suffer from two flaws that can lead to misleading conclusions. First they are computed on statistics derived from the training dataset and therefore do not necessarily inform us on which features are most important to make good predictions on held-out dataset. Secondly, they favor high cardinality features, that is features with many unique values." (sklearn)

  1. Sometimes a slight increase in bias may occur

Setting up the Analysis

Database: United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv

  • Explore new cases and new deaths from COVID-19 by state over the years 2020 and 2021

  • Use the yearly statistics from this database in the analysis of other features in the vax_cases_death.csv database also

  • Database features are the new cases and the new deaths from COVID-19 and the states

  • Label columns are based on the means for the year 2020 and the means for the years 2020 and 2021 combined

Database vax_cases_death.csv

  • Explore the distributions and administrations of vaccines over the year 2021 by state

  • Possibly explore additional features.

  • Database features are the distributions, the administrations, and the states

  • Label columns are based on the means for the year 2020 and the means for the years 2020 and 2021 combined (from database United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv)

Imbalance: Note that the classier set (labels) is NOT imbalanced. This was verified by doing a values count on the label colummns.

Finding the optimal model

According to sklearn,the scikit-learn implementation differs from the original random forest conceptualization of randon forests in which each sample-sized decision tree votes for the best outcome. In scikit-learn the ensemble learners' outcome is averaged. It is known that individual decision trees typically have high variance which causes overfitting. Either the voting or averaging algorithms tend to reduce the variance so that the randpm forest trees are less likely overfit the data. By assessing the model results, parameters can be found which provide a predictive fit rather than an overfit. The model parameters which are explored to find an optimal model are as follows:

n_estimators=128
random_state=78
criterion = 'gini' or 'entropy'
max_depth = None or 10
max_features = 'auto' or 'sqrt'
min_impurity_decrease = 0.0 or a fraction
oob_score = False or True


Table: Parameter Evaluations to Find an Optimal Model

notebook n_estimators random_state criterion max_depth max_features min_impurity_decrease oob_score
1 128 78 gini None 'auto' 0.0 False
2 128 78 gini None 'auto' 0.0 True
3 128 78 entropy None 'auto' 0.0 False
4 128 78 entropy None 'auto' 0.0 True
5 128 78 gini 10 'sqrt' 0.0 False
6 128 78 gini 10 'sqrt' 0.0 True
7 128 78 entropy 10 'sqrt' 0.0 False
8 128 78 entropy 10 'sqrt' 0.0 True
9 128 78 gini None 'sqrt' 0.02 False
10 128 78 gini None 'sqrt' 0.02 True
11 128 78 entropy None 'sqrt' 0.02 False
12 128 78 entropy None 'sqrt' 0.02 True
13 128 78 entropy 10 'sqrt' 0.0 True
14 128 78 gini None 'sqrt' 0.5 False
15 128 78 gini None 'sqrt' 0.5 True



Model Evaluation

Cases and Deaths

The approach is to create a label column with binary outcomes from the total cases and the total deaths and then to use a random forest classifer to analyze features. Here the binary outcome is that features result in an above average or a below average number of cases or deaths by:

  • First, evaluating the mean from 2020 found from the United_States_COVID-19_Cases_and_Deaths_by_State_over_Time database (Database 1).

  • Second, evaluating the mean from 2020 and 2021 found from the United_States_COVID-19_Cases_and_Deaths_by_State_over_Time database (Database 2).

These two means are used for feature analyses with both the United_States_COVID-19_Cases_and_Deaths_by_State_over_Time database and the vax_cases_death.csv database. The following table illustrates the four label columns created:

Table: Label Column Outcomes

Database 1 Database 2
Cases C1 C2
Daeths D1 D2

Database 1: United_States_COVID-19_Cases_and_Deaths_by_State_over_Time database
Database 2: vax_cases_death database
C1: >= or < number of cases mean of 2020
C2: >= or < number of cases mean of 2020 and 2021
D1: >= or < number of deaths mean of 2020
D2: >= or < number of deaths mean of 2020 and 2021

The features used in the evaluation from Database 1 and Database 2 were described in the section on setting up the analysis. The data preprocessing was to clean the data and create the label columns.

Training and Testing

Splitting of the data into training and testing sets for Random Forest modeling used the default values in scikit-learn. The optimized model (notebook 6 above) was found using Database 1 in this training and testing manner. It is then used to learn on Database 2.

Model Results

The model results, for both optimal and non-optimal models, are read into a PostgreSQL database for helping in analysis and presentations. The model results database has tables for holding the model input, statistics for the number of cases and the number of deaths, random forest feature importances, and other model results used in machine learning.

Some results for the optimized model are given in the following table:

(scroll->)

CM_A0P0_cases CM_A0P1_cases CM_A1P0_cases CM_A1P1_cases CM_A0P0_death CM_A0P1_death CM_A1P0_death CM_A1P1_death acc_score_cases acc_score_death
Database 1 1448 663 360 2096 1448 663 360 2096 0.776001752 0.776001752
Database 1 2770 246 298 1253 2496 295 587 1189 0.880884607 0.806875411
Database 2 71 11 0 516 170 10 3 415 0.981605351 0.97826087
Database 2 245 4 21 328 271 9 38 280 0.95819398 0.921404682


These are parameters from the confusion matrix and accuracy scores. Here, A is for Actual and P is for Predicted. As an example, CM_A0P0_cases means that for the results on the cases database the Actual label is 0 and the Predicted lable is 0, CM_A0P1_cases means that the Actual label is 0 and the Predicted label is 1, etc. From these results it can be seen that the optimized model performed very well for both cases and deaths from Database 1 and Database 2.

Appropriate Predictive Value

The appropriate predictive value is precision. This is because precision answers the question: "The test for COVID-19 came back positive. How likely is it that the test is correct?" Here precision is defined as:

(True Positive)/(True Positive + False Positive)

Displays of Results

  1. Machine learning model performance optimization graph

  2. Visualizations using importance measures of features

Results files

The results of the random forest model can be found below:

optimal model csv format files:

rfinput_optimal.csv
mlsetstat_optimal.csv
rfimportance_optimal.csv
rfresult_optimal.csv

analysis csv format files:

mlinputs_1.csv (C1 and D1 resuls)
mlinputs2.csv (C2 and D2 results)
analysis_output.csv (C1, D1, C2, and D2 results)

References:

https://scikit-learn.org/stable/modules/ensemble.html?highlight=min_samples_split#random-forest-parameters
https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_iris.html
Module 18.7.8.1