Random Forest Model

The random forest model uses bootstrap samples and an ensemble of decision trees for learning.

Questions to be Answered

What features contribute most to being classified as above or below average of the year 2020?
(average of total cases and of total deaths)
What features contribute most to being classified as above or below average of years 2020 and 2021?
(average of total cases and of total deaths)
What are the differences?
If the number of features is reduced, what do the differences look like?

Reasons for choosing the model

Benefits

Provides importances of features
Variance is reduced by introducing randomness with replacement which improves the model's fit
Datasets are sufficiently large for this approach
Robust to outliers and nonlinear data
Computation time for a random forest is also a benefit

Drawbacks

It may not always be able to ascertain why a random forest makes its decision. Altough sklearn provides feature importances, their limitations should be considered:

"The impurity-based feature importances computed on tree-based models suffer from two flaws that can lead to misleading conclusions. First they are computed on statistics derived from the training dataset and therefore do not necessarily inform us on which features are most important to make good predictions on held-out dataset. Secondly, they favor high cardinality features, that is features with many unique values." (sklearn)

Sometimes a slight increase in bias may occur

Setting up the Analysis

Database: United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv

Explore new cases and new deaths from COVID-19 by state over the years 2020 and 2021
Use the yearly statistics from this database in the analysis of other features in the vax_cases_death.csv database also
Database features are the new cases and the new deaths from COVID-19 and the states
Label columns are based on the means for the year 2020 and the means for the years 2020 and 2021 combined

Database vax_cases_death.csv

Explore the distributions and administrations of vaccines over the year 2021 by state
Possibly explore additional features.
Database features are the distributions, the administrations, and the states
Label columns are based on the means for the year 2020 and the means for the years 2020 and 2021 combined (from database United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv)

Imbalance: Note that the classier set (labels) is NOT imbalanced. This was verified by doing a values count on the label colummns.

Finding the optimal model

According to sklearn,the scikit-learn implementation differs from the original random forest conceptualization of randon forests in which each sample-sized decision tree votes for the best outcome. In scikit-learn the ensemble learners' outcome is averaged. It is known that individual decision trees typically have high variance which causes overfitting. Either the voting or averaging algorithms tend to reduce the variance so that the randpm forest trees are less likely overfit the data. By assessing the model results, parameters can be found which provide a predictive fit rather than an overfit. The model parameters which are explored to find an optimal model are as follows:

n_estimators=128
random_state=78
criterion = 'gini' or 'entropy'
max_depth = None or 10
max_features = 'auto' or 'sqrt'
min_impurity_decrease = 0.0 or a fraction
oob_score = False or True

Table: Parameter Evaluations to Find an Optimal Model

notebook	n_estimators	random_state	criterion	max_depth	max_features	min_impurity_decrease	oob_score
1	128	78	gini	None	'auto'	0.0	False
2	128	78	gini	None	'auto'	0.0	True
3	128	78	entropy	None	'auto'	0.0	False
4	128	78	entropy	None	'auto'	0.0	True
5	128	78	gini	10	'sqrt'	0.0	False
6	128	78	gini	10	'sqrt'	0.0	True
7	128	78	entropy	10	'sqrt'	0.0	False
8	128	78	entropy	10	'sqrt'	0.0	True
9	128	78	gini	None	'sqrt'	0.02	False
10	128	78	gini	None	'sqrt'	0.02	True
11	128	78	entropy	None	'sqrt'	0.02	False
12	128	78	entropy	None	'sqrt'	0.02	True
13	128	78	entropy	10	'sqrt'	0.0	True
14	128	78	gini	None	'sqrt'	0.5	False
15	128	78	gini	None	'sqrt'	0.5	True

Model Evaluation

Cases and Deaths

The approach is to create a label column with binary outcomes from the total cases and the total deaths and then to use a random forest classifer to analyze features. Here the binary outcome is that features result in an above average or a below average number of cases or deaths by:

First, evaluating the mean from 2020 found from the United_States_COVID-19_Cases_and_Deaths_by_State_over_Time database (Database 1).
Second, evaluating the mean from 2020 and 2021 found from the United_States_COVID-19_Cases_and_Deaths_by_State_over_Time database (Database 2).

These two means are used for feature analyses with both the United_States_COVID-19_Cases_and_Deaths_by_State_over_Time database and the vax_cases_death.csv database. The following table illustrates the four label columns created:

Table: Label Column Outcomes

	Database 1	Database 2
Cases	C1	C2
Daeths	D1	D2

Database 1: United_States_COVID-19_Cases_and_Deaths_by_State_over_Time database
Database 2: vax_cases_death database
C1: >= or < number of cases mean of 2020
C2: >= or < number of cases mean of 2020 and 2021
D1: >= or < number of deaths mean of 2020
D2: >= or < number of deaths mean of 2020 and 2021

The features used in the evaluation from Database 1 and Database 2 were described in the section on setting up the analysis. The data preprocessing was to clean the data and create the label columns.

Training and Testing

Splitting of the data into training and testing sets for Random Forest modeling used the default values in scikit-learn. The optimized model (notebook 6 above) was found using Database 1 in this training and testing manner. It is then used to learn on Database 2.

Model Results

The model results, for both optimal and non-optimal models, are read into a PostgreSQL database for helping in analysis and presentations. The model results database has tables for holding the model input, statistics for the number of cases and the number of deaths, random forest feature importances, and other model results used in machine learning.

Some results for the optimized model are given in the following table:

(scroll->)

	CM_A0P0_cases	CM_A0P1_cases	CM_A1P0_cases	CM_A1P1_cases	CM_A0P0_death	CM_A0P1_death	CM_A1P0_death	CM_A1P1_death	acc_score_cases	acc_score_death
Database 1	1448	663	360	2096	1448	663	360	2096	0.776001752	0.776001752
Database 1	2770	246	298	1253	2496	295	587	1189	0.880884607	0.806875411
Database 2	71	11	0	516	170	10	3	415	0.981605351	0.97826087
Database 2	245	4	21	328	271	9	38	280	0.95819398	0.921404682

These are parameters from the confusion matrix and accuracy scores. Here, A is for Actual and P is for Predicted. As an example, CM_A0P0_cases means that for the results on the cases database the Actual label is 0 and the Predicted lable is 0, CM_A0P1_cases means that the Actual label is 0 and the Predicted label is 1, etc. From these results it can be seen that the optimized model performed very well for both cases and deaths from Database 1 and Database 2.

Appropriate Predictive Value

The appropriate predictive value is precision. This is because precision answers the question: "The test for COVID-19 came back positive. How likely is it that the test is correct?" Here precision is defined as:

(True Positive)/(True Positive + False Positive)

Displays of Results

Machine learning model performance optimization graph
Visualizations using importance measures of features

Results files

The results of the random forest model can be found below:

optimal model csv format files:

rfinput_optimal.csv
mlsetstat_optimal.csv
rfimportance_optimal.csv
rfresult_optimal.csv

analysis csv format files:

mlinputs_1.csv (C1 and D1 resuls)
mlinputs2.csv (C2 and D2 results)
analysis_output.csv (C1, D1, C2, and D2 results)

References:

https://scikit-learn.org/stable/modules/ensemble.html?highlight=min_samples_split#random-forest-parameters
https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_iris.html
Module 18.7.8.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLmodel.md

MLmodel.md

Random Forest Model

Questions to be Answered

Reasons for choosing the model

Benefits

Drawbacks

Setting up the Analysis

Finding the optimal model

Model Evaluation

Cases and Deaths

Training and Testing

Model Results

Appropriate Predictive Value

Displays of Results

Results files

References:

Files

MLmodel.md

Latest commit

History

MLmodel.md

File metadata and controls

Random Forest Model

Questions to be Answered

Reasons for choosing the model

Benefits

Drawbacks

Setting up the Analysis

Finding the optimal model

Model Evaluation

Cases and Deaths

Training and Testing

Model Results

Appropriate Predictive Value

Displays of Results

Results files

References: