-
Notifications
You must be signed in to change notification settings - Fork 297
Evaluation Metrics
Modeling problems that involve time series of varying scales present a difficult challenge regarding model evaluation. In this section we review two classes of commonly used metrics and justify reasons for avoiding them in our use case.
- Scale-dependent measures
- Examples: MSE, RMSE, MAE, and MAE
- Advantage: These metrics are highly interpretable as they emphasize the difference between actual and fitted values.
- Disadvantages: The difference between actual and fitted values is highly influenced by the scale of any particular time series. For example, river flows in a small creek will always be smaller than a major river, therefore the scale of residual errors will depend on the particular time series in question. This means these metrics lose any meaningful interpretation once results from multiple time series are aggregated together.
- Recommendation: Should not be used to evaluate such problems as the scale of the errors is dependent on the scale of the time series, so
- Percentage based measures
- Examples: MAPE, MdAPE, RMSPE, RMdSPE, sMAPE, and sMdAPE
- Advantage: Being scale-independent, these are frequently used to compare forecast performance across data sets.
- Disadvantages: Can be infinite or undefined if observed values can be 0. Additionally, they follow a dramatically skewed distribution when observed values are close to 0. This skewness can be stabilized using transformations (such as logarithms) but
- Recommendation: If observations of 0 occur frequently it is impossible to use these measures without artificially restricting the training data to positive data.
- Benchmark Approaches
- Examples: R-squared, Nash-Sutcliffe Efficiency (NSE)
- Advantages: These are both well known in statistical literature / hydrology specifically.
- Review: The coefficient of determination (or R-squared) is a classical evaluation method that can be understood as the ratio between how good our model is versus a simple model that always predicts the average of all values. (Basically, it is comparing the fit of the chosen model with that of a horizontal straight line equal to the mean). A value close to 1 indicates our model has nearly zero error, a value close to 0 means our model is about as good as the mean baseline, and a value less than 0 indicates our model is worse than the straight horizontal line at modeling the data.
Disadvantages of R-squared:
- Problem 1: This measure is intended for use in evaluation performance of a linear model, and is meant to be considered in conjunction with residual plots.
- Problem 2: R-squared can be close to 1 even when the model is completely wrong
- Problem 3: R-squared can be arbitrarily low even when the model is well-calibrated
- Problem 4: R-squared cannot be compared across datasets. It can be shown that the exact same model can have radically different R-squared values on different data.
Disadvantages of NSE:
As stated by Alexmex on stack exchange, the largest disadvantage of NSE is the fact that the differences between the observed and predicted values are calculated as squared values. As a result larger values in a time series are strongly overestimated whereas lower values are neglected (Legates and McCabe, 1999). For the quantification of runoff predictions this leads to an overestimation of the model performance during peak flows and an underestimation during low flow conditions. Similar to r2, the Nash-Sutcliffe is not very sensitive to systematic model over or underprediction especially during low flow periods.
NSE criterion consists of three components: correlation, bias and a measure of variability. In order to maximize NSE the variability has to be underestimated. Further, the bias is scaled by the standard deviation in the observed values, which complicates a comparison between basins.
Recommendation: Due to the inherent shortcomings outlined above it appears that neither of these metrics will be the optimal choice for evaluating model flow forecast performance. For additional information regarding the drawbacks of R-squared specifically we can refer to these lecture notes contain additional details. See page 17 for the section on R-squared.
In this section we propose an alternative metric that can be used to evaluate river flow forecasts.
In the quintessential time series book Forecasting: Principles & Practice Hyndman defines Mean Absolute Scaled Error (MASE) in section 3.4 Evaluating Forecast Accuracy. In simplest terms, MASE can be understood as the ratio of MAE from the desired model over MAE of the benchmark model. Then, MASE < 1 would verify that the neural network is outperforming the naive baseline, MASE > 1 would imply errors from the naive baseline are smaller than the network.
- Symmetrical and resistant to outliers
- Because the numerator and denominator both involve values on the scale of the original data, the quotient is by design independent of the scale of the data.
- We can aggregate MASE statistics across thousands of rivers and still maintain a meaningful interpretation of model performance, in aggregate.
- Requires the establishing of a naive baseline ( a random walk or mean forecast)
- Division by zero occurs in the trivial case where all input time series are equal (i.e constant)
- Assumes MAE is a relevant foundation on which to compute performance
KGE stands for Kling Gupta Efficiency and is easily formulated by computing the Euclidian distance of the three NSE components from the ideal point. This avoids the problems associated with NSE (but also introduces new problems).