This repository includes the python code of four models that were used to predict the water temperature of 83 rivers with limiting forcing data (with 98% of missing data). The results of this study are described in the following manuscript: Almeida, M.C. and Coelho P.S.: Modeling river water temperature with limiting forcing data: air2stream v1.0.0, machine learning and multiple regression:
- Random Forest (vide sklearn webpage)
- Artificial Neural Network (Momentum algorithm) (vide neupy webpage)
- Support Vector Regression (vide sklearn webpage)
- Multiple Regression (vide sklearn webpage)
- We have also included the hybrid air2stream (vide Toffolon and Piccolroaz, 2015). This benchmark model was used to make results comparable with other studies.
The machine learning models hyperparameter optimization was implemented with the Tree-structured Parzen Estimators algorithm (TPE) (Bergstra et al 2011). The python code implementation of TPE with the Hyperot algorithm (Bergstra et al 2013) is also available.
The raw training datasets were modified with an under/oversampling technique. 100 different training datasets are derived for each station from the initial dataset through the application of the Synthetic Minority Over-Sampling Technique for regression with Gaussian Noise (SMOGN) (Branco et al. 2017). The python code implementation of SMOGN is also available. This code applies the TPE algorith, SMOGN and runs a random forest regressor.
Additionaly, we have included the python code that was used to quantify the features importance with a random forest regressor (vide sklearn webpage). The random forest regressor with the following parameters: n_estimators = 50, max_depth = 485, min_samples_split = 5, max_features = 'auto', bootstrap = True; was the best performing model for stations with 98% of missing data. (vide Almeida and Coelho, 2022).
In the folder Input data we have included 83 input files. These files include the following nine columns:
- Date (e.g. 10/24/1988 12:00:00 AM);
- Observed water temperature,(°C);
- Mean daily air temperature,(°C);
- Discharge,(m3s-1);
- Mean daily Global radiation,(Jm-2);
- Maximum day air temperature,(°C);
- Minimum day air temperature,(°C);
- Month of the year (e.g. 1, 2, 3,..., 12);
- Day of the year (e.g. 1, 2, 3,..., 365).
It is easy to find the model parameters in the code. Nonetheless, in the folowing table we have included the models parameters that are optimized with the TPE algorithm.
Model | Prior distribution | Parameter | Optimization range |
---|---|---|---|
Random Forest | uniform | 'n_estimators' | [50, 2000] |
Random Forest | uniform | 'max_depth' | [10, 1000] |
Random Forest | uniform | 'min_samples_split' | [2, 10] |
Random Forest | - | 'max_features' | [auto, sqrt] |
Random Forest | - | 'bootstrap' | [True, False] |
ANN | categorical | 'n_layers' | [1, 2] |
ANN | uniform integer | 'n_units_layer' | [10, 50] |
ANN | categorical | 'act_func_type' | ['Relu', 'PRelu', 'Elu', 'Tanh', 'Sigmoid'] |
ANN | categorical | 'regularization' | [True, False] |
ANN | quantized distribution | 'n_epochs' | With regularization: [500, 1000]; without regularization: [20, 300] |
ANN | uniform | 'dropout' | [0, 1.0] |
ANN | loguniform | 'batch_size' | [5, 20] |
ANN | uniform | 'initial_value' | [0.001, 0.1] |
ANN | uniform | 'reduction_freq' | [10, 200] |
ANN | uniform | 'decay_rate' (regularization) | [0.0001, 0.001] |
SVR | Categorical | 'C' | [0.1,1,100,1000] |
SVR | Categorical | 'kernel' | ['rbf','poly','sigmoid','linear'] |
SVR | Categorical | 'degree' | [1,2,3,4,5,6] |
SVR | Categorical | 'gamma' | [1, 0.1, 0.01, 0.001, 0.0001] |
SVR | Categorical | 'epsilon' | [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10] |
- Install neupy from the neupy webpage;
- Create an empty folder;
- In this folder place the python code file (e.g. Hyper_ANN.py) and the input file (e.g. st1.xlsx); In the code file (e.g. Hyper_ANN.py) set the training and validation percentages of the dataset (e.g. train_size=0.7, test_size=0.3);
- Run the code. The output includes: file with score for each model run; file with the parameters for each model run; file with the Mean Average Error (MAE) for the training dataset; file with the MAE for the validation dataset.
- Create an empty folder;
- In this folder place the python code file (e.g. ANN.py) and the input file or files (e.g. st1.xlsx; st2.xlsx; st3.xlsx;...;st100.xlsx). In the code file (e.g. ANN.py.py) set the training and validation percentages of the dataset (e.g. train_size=0.7, test_size=0.3; Replace the model parameters with the value obtained in 4;
- Run the code. The output includes: file with the predicted values for the training dataset (1-st1.xlsxtrain.xlsx) and a file with the predicted values for the testing dataset (2-st1.xlsxtest.xlsx).
- Install SMOGN from https://pypi.org/project/smogn/;
- Create an empty folder;
- In this folder place the python code file (e.g. Random_forest_Hyperopt_SMOGN.py) and the input file (e.g. st46.xlsx); In the code file (e.g. Random_forest_Hyperopt_SMOGN.py) set the training and validation percentages of the dataset (e.g. train_size=0.7, test_size=0.3);
- Run the code. The output includes: file with the modified training dataset (e.g.st46.xlsxSMOGN_out0.xlsx); file with the SMOGN parameters for the 100 modified training dataset (st46.xlsxSMOGN_parameters_out99.xlsx); file with the parameters for the ML model run (st46.xlsxparameters0.csv); file with the Mean Average Error (MAE) and the Nash–Sutcliffe model efficiency coefficient (NSE) for the 100 modified training datasets (A-st46.xlsxmodel_out99.xlsx); file with the predicted values for the training dataset (st46.xlsxtrain.xlsx) and a file with the predicted values for the testing dataset (st46.xlsxtest.xlsx).
- Create an empty folder;
- In this folder place the python code file (Random Forest_Feature_importance.py) and the input files (e.g. st1.xlsx; st2.xlsx; st3.xlsx;...;st100.xlsx). In the code file (Random Forest_Feature_importance.py) set the training and validation percentages of the dataset (e.g. train_size=0.7, test_size=0.3. Change the path to the output file (importance.csv).
Almeida, M.C. and Coelho P.S.: Modeling river water temperature with limiting forcing data,...
Bergstra, J. S., Bardenet, R., Bengio, Y. and Kegl, B.: Algorithms for hyper-parameter optimization, in Advances in Neural Information Processing Systems, 2011, 2546–2554, 2011.
Bergstra, J., Yamins, D., Cox, D. D.: Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. TProc. of the 30th International Conference on Machine Learning (ICML 2013), 115-23, 2013.
Branco, P., Ribeiro, R. P., Torgo, L., Krawczyk, B., Moniz, N.: Smogn: a pre-processing approach for imbalanced regression, Proceedings of Machine Learning Research 74, 36–50, 2017.
Toffolon, M. and Piccolroaz, S.: A hybrid model for river water temperature as a function of air temperature and discharge, types for water temperature prediction in rivers, Journal Hydrology 529, 302–315, https://doi.org/10.1016/j.jhydrol.2015.07.044, 2015.