The random forest model uses bootstrap samples and an ensemble of decision trees for learning.
-
What features contribute most to being classified as above or below average of the year 2020?
(average of total cases and of total deaths) -
What features contribute most to being classified as above or below average of years 2020 and 2021?
(average of total cases and of total deaths) -
What are the differences?
-
If the number of features is reduced, what do the differences look like?
-
Provides importances of features
-
Variance is reduced by introducing randomness with replacement which improves the model's fit
-
Datasets are sufficiently large for this approach
-
Robust to outliers and nonlinear data
-
Computation time for a random forest is also a benefit
- It may not always be able to ascertain why a random forest makes its decision. Altough sklearn provides feature importances, their limitations should be considered:
"The impurity-based feature importances computed on tree-based models suffer from two flaws that can lead to misleading conclusions. First they are computed on statistics derived from the training dataset and therefore do not necessarily inform us on which features are most important to make good predictions on held-out dataset. Secondly, they favor high cardinality features, that is features with many unique values." (sklearn)
- Sometimes a slight increase in bias may occur
Database: United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv
-
Explore new cases and new deaths from COVID-19 by state over the years 2020 and 2021
-
Use the yearly statistics from this database in the analysis of other features in the vax_cases_death.csv database also
-
Database features are the new cases and the new deaths from COVID-19 and the states
-
Label columns are based on the means for the year 2020 and the means for the years 2020 and 2021 combined
Database vax_cases_death.csv
-
Explore the distributions and administrations of vaccines over the year 2021 by state
-
Possibly explore additional features.
-
Database features are the distributions, the administrations, and the states
-
Label columns are based on the means for the year 2020 and the means for the years 2020 and 2021 combined (from database United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv)
Imbalance: Note that the classier set (labels) is NOT imbalanced. This was verified by doing a values count on the label colummns.
According to sklearn,the scikit-learn implementation differs from the original random forest conceptualization of randon forests in which each sample-sized decision tree votes for the best outcome. In scikit-learn the ensemble learners' outcome is averaged. It is known that individual decision trees typically have high variance which causes overfitting. Either the voting or averaging algorithms tend to reduce the variance so that the randpm forest trees are less likely overfit the data. By assessing the model results, parameters can be found which provide a predictive fit rather than an overfit. The model parameters which are explored to find an optimal model are as follows:
n_estimators=128
random_state=78
criterion = 'gini' or 'entropy'
max_depth = None or 10
max_features = 'auto' or 'sqrt'
min_impurity_decrease = 0.0 or a fraction
oob_score = False or True
Table: Parameter Evaluations to Find an Optimal Model
notebook | n_estimators | random_state | criterion | max_depth | max_features | min_impurity_decrease | oob_score |
---|---|---|---|---|---|---|---|
1 | 128 | 78 | gini | None | 'auto' | 0.0 | False |
2 | 128 | 78 | gini | None | 'auto' | 0.0 | True |
3 | 128 | 78 | entropy | None | 'auto' | 0.0 | False |
4 | 128 | 78 | entropy | None | 'auto' | 0.0 | True |
5 | 128 | 78 | gini | 10 | 'sqrt' | 0.0 | False |
6 | 128 | 78 | gini | 10 | 'sqrt' | 0.0 | True |
7 | 128 | 78 | entropy | 10 | 'sqrt' | 0.0 | False |
8 | 128 | 78 | entropy | 10 | 'sqrt' | 0.0 | True |
9 | 128 | 78 | gini | None | 'sqrt' | 0.02 | False |
10 | 128 | 78 | gini | None | 'sqrt' | 0.02 | True |
11 | 128 | 78 | entropy | None | 'sqrt' | 0.02 | False |
12 | 128 | 78 | entropy | None | 'sqrt' | 0.02 | True |
13 | 128 | 78 | entropy | 10 | 'sqrt' | 0.0 | True |
14 | 128 | 78 | gini | None | 'sqrt' | 0.5 | False |
15 | 128 | 78 | gini | None | 'sqrt' | 0.5 | True |
The approach is to create a label column with binary outcomes from the total cases and the total deaths and then to use a random forest classifer to analyze features. Here the binary outcome is that features result in an above average or a below average number of cases or deaths by:
-
First, evaluating the mean from 2020 found from the United_States_COVID-19_Cases_and_Deaths_by_State_over_Time database (Database 1).
-
Second, evaluating the mean from 2020 and 2021 found from the United_States_COVID-19_Cases_and_Deaths_by_State_over_Time database (Database 2).
These two means are used for feature analyses with both the United_States_COVID-19_Cases_and_Deaths_by_State_over_Time database and the vax_cases_death.csv database. The following table illustrates the four label columns created:
Table: Label Column Outcomes
Database 1 | Database 2 | |
---|---|---|
Cases | C1 | C2 |
Daeths | D1 | D2 |
Database 1: United_States_COVID-19_Cases_and_Deaths_by_State_over_Time database
Database 2: vax_cases_death database
C1: >= or < number of cases mean of 2020
C2: >= or < number of cases mean of 2020 and 2021
D1: >= or < number of deaths mean of 2020
D2: >= or < number of deaths mean of 2020 and 2021
The features used in the evaluation from Database 1 and Database 2 were described in the section on setting up the analysis. The data preprocessing was to clean the data and create the label columns.
Splitting of the data into training and testing sets for Random Forest modeling used the default values in scikit-learn. The optimized model (notebook 6 above) was found using Database 1 in this training and testing manner. It is then used to learn on Database 2.
The model results, for both optimal and non-optimal models, are read into a PostgreSQL database for helping in analysis and presentations. The model results database has tables for holding the model input, statistics for the number of cases and the number of deaths, random forest feature importances, and other model results used in machine learning.
Some results for the optimized model are given in the following table:
(scroll->)
CM_A0P0_cases | CM_A0P1_cases | CM_A1P0_cases | CM_A1P1_cases | CM_A0P0_death | CM_A0P1_death | CM_A1P0_death | CM_A1P1_death | acc_score_cases | acc_score_death | |
---|---|---|---|---|---|---|---|---|---|---|
Database 1 | 1448 | 663 | 360 | 2096 | 1448 | 663 | 360 | 2096 | 0.776001752 | 0.776001752 |
Database 1 | 2770 | 246 | 298 | 1253 | 2496 | 295 | 587 | 1189 | 0.880884607 | 0.806875411 |
Database 2 | 71 | 11 | 0 | 516 | 170 | 10 | 3 | 415 | 0.981605351 | 0.97826087 |
Database 2 | 245 | 4 | 21 | 328 | 271 | 9 | 38 | 280 | 0.95819398 | 0.921404682 |
These are parameters from the confusion matrix and accuracy scores. Here, A is for Actual and P is for Predicted. As an example, CM_A0P0_cases means that for the results on the cases database the Actual label is 0 and the Predicted lable is 0, CM_A0P1_cases means that the Actual label is 0 and the Predicted label is 1, etc. From these results it can be seen that the optimized model performed very well for both cases and deaths from Database 1 and Database 2.
The appropriate predictive value is precision. This is because precision answers the question: "The test for COVID-19 came back positive. How likely is it that the test is correct?" Here precision is defined as:
(True Positive)/(True Positive + False Positive)
-
Machine learning model performance optimization graph
-
Visualizations using importance measures of features
The results of the random forest model can be found below:
optimal model csv format files:
rfinput_optimal.csv
mlsetstat_optimal.csv
rfimportance_optimal.csv
rfresult_optimal.csv
analysis csv format files:
mlinputs_1.csv (C1 and D1 resuls)
mlinputs2.csv (C2 and D2 results)
analysis_output.csv (C1, D1, C2, and D2 results)
https://scikit-learn.org/stable/modules/ensemble.html?highlight=min_samples_split#random-forest-parameters
https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_iris.html
Module 18.7.8.1