AutoML_Project2

Please, save data file as csv into the data folder ('data\Final_Africa_Master_File 2.csv'). It is not part of the git folder due to large size.

Data preparation (notebook)

Contributors: Jan and Kea

Included only the columns Radwa said we should include. "Take into consideration: Sow_Maize, Harvest_Maize, Sand (1-7), Clay (1-7), OC (1-7), PAW (1-7), pcp, tmax, tmin, spi, and Y_maize_major. We should predict next year crop yield."
Dropped rows where the target Y_maize_major is NA.
For 6 countries the Sow Maize and Harvest Maize were NA. Dropped those countries. 30 countries remained.
Created a Farm column from latitude and longitude columns. So e.g. Farm value "Farm_110_Angola" means farm number 110 in Angola. Dropped latitude and longitude columns after that.
From Sow Maize and Harvest Maize extracted the month, encoded it and dropped the original columns.
Added time between sow and harvest.
Calculated the mean of pcp, tmax, tmin, and spi columns.
Created lagged dependent variable values and the mean pcp, tmax, tmin, and spi for the preceding 3 years.
Dropped the pcp, tmax, tmin, and spi columns for the same year because we do not have this information at prediction time.
Kept only latest 10 years of data, so 2007-2016.

In the final dataframe there are 32,359 rows, 30 countries, 3,887 Farms, 10 years of data.

Baseline selection (notebook)

Contributors: Kea and Jan

Dropped Countries and Farm variables. We also tried one-hot-encoding them but it increased the runtime multiple times.
Made a train-test split. The train split got years 2007-2015. The test split got the year 2016.
From the train split created 5 Time Series data splits.
As regressors for baseline selection we took KNN, Random Forest, Adaboost, Linear Regression and LightGBM. First three cannot extrapolate, last two can.
Used MinMaxScaler.
Based on CV Linear Regression had the best performance in both RMSE and MAE.

Refitted Linear Regression on train and predicted on test. Performance on test set (year 2016) (baseline results): RMSE 0.33, MAE 0.213.

AutoML Frameworks comparison

We compare 5 AutoML frameworks against the baseline results. The frameworks are: TPOT, AutoGluon, PyCaret, AutoKeras, and X.
The notebooks are here:
- TPOT notebook by Jan.
- AutoGluon notebook by Andri.
- PyCaret notebook by Kea. Best model was tuned HuberRegressor(alpha=0.01, epsilon=1.1).
- AutoKeras notebook and H2O notebook notebooks by Valerija. In H2O two best models have been compared, since those are StackedEnsemble_BestOfFamily and StackedEnsemble_AllModels and the results of the cross-validation could have been misleading.

Comparison of AutoML frameworks by performance on the test set (year 2016).

Approach details	Baseline	TPOT	AutoGluon	PyCaret	AutoKeras	H2O
Data preprocessing	MinMax Scaler	None	None	Robust Scaler	Automated	Automated
Model	Linear Regression	PCA + ElasticNetCV	ExtraTreesMSE	Tuned Huber Regressor	NN*	'StackedEnsemble_BestOfFamily'
RMSE	0.33	0.2818	0.3069	0.3262	0.4081	0.334
MAE	0.213	0.1856	0.1901	0.2000	0.3361	0.2143

AutoKeras NN - The structure of the model that AutoKeras chose is the following:

An input layer that takes in the features.
A multi-category encoding layer which is likely handling categorical variables.
A dense layer with 32 neurons followed by a ReLU activation.
Another dense layer* with 16 neurons also followed by a ReLU activation.
A regression head(a single dense neuron with a linear activation), which is the output layer for regression tasks

Hyperopt (notebook) experiment on limited subsample and small number on iterations was also conducted by Andri on four baseline regressors. However due to the high computational cost, it was not done in full over all the regressors on full dataset with optimal number iterations and is therefore not included in final results. As a test, MAE for linear regression on full dataset with 200 iterations was improved as compared to baseline metric, reaching the level of PyCaret.

Model interpretation

The best TPOT model interepretation - by Kea

The best model in terms of performance on the test set, includes PCA which makes interpretation difficult. We can examine which features had most influence on each component and which components were most important as inputs to the model:

However, further we use explainability tools on simple Random Forest model without PCA to get more insights to the important features, their relationship with the dependant variable and explanations for specific predictions.

Model and Features Interpretability - by Jan

Here is the full notebook with Interpretability, which contains:

Information value using Random Forest (top 10 features)

Partial Dependence Plots
- We tested several partial dependence plots such as:
  - Maize_lag-1 and lag-2 separately
  - Maize_lag-1 and lag-2 combined
  - Longitude and Latitude separately
  - Maize_lag-1 and precipitation
  - Sow_month and Lattitude
LIME explanation of three instances
- We explained three instances, in Tanzania, South Africa and Egypt
  - Tanzania example
  - South Africa example
  - Egypt example

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
experiment_results		experiment_results
functions		functions
notebooks		notebooks
plots		plots
saved_pipelines		saved_pipelines
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AutoML_Project2

Please, save data file as csv into the data folder ('data\Final_Africa_Master_File 2.csv'). It is not part of the git folder due to large size.

Data preparation (notebook)

Baseline selection (notebook)

AutoML Frameworks comparison

Model interpretation

The best TPOT model interepretation - by Kea

Model and Features Interpretability - by Jan

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

jtimko16/AutoML_Project2

Folders and files

Latest commit

History

Repository files navigation

AutoML_Project2

Please, save data file as csv into the data folder ('data\Final_Africa_Master_File 2.csv'). It is not part of the git folder due to large size.

Data preparation (notebook)

Baseline selection (notebook)

AutoML Frameworks comparison

Model interpretation

The best TPOT model interepretation - by Kea

Model and Features Interpretability - by Jan

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages