Skip to content

Commit

Permalink
fix ensemble to ensure appropriate weighting of forecast draws (#98)…
Browse files Browse the repository at this point in the history
…; add links to Youtube webinar series on Readme and Index; rebuild a few vignettes
  • Loading branch information
nicholasjclark committed Dec 16, 2024
1 parent bbd9d1e commit a23a51a
Show file tree
Hide file tree
Showing 60 changed files with 414 additions and 322 deletions.
1 change: 1 addition & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
* Arguments `use_stan`, `jags_path`, `data_train`, `data_test`, `adapt_delta`, `max_treedepth` and `drift` have been removed from primary functions to streamline documentation and reflect the package's mission to deprecate 'JAGS' as a suitable backend. Both `adapt_delta` and `max_treedepth` should now be supplied in a named `list()` to the new argument `control`

## Bug fixes
* Updates to ensure `ensemble` provides appropriate weighting of forecast draws (#98)
* Not necessarily a "bug fix", but this update removes several dependencies to lighten installation and improve efficiency of the workflow (#93)
* Fixed a minor bug in the way `trend_map` recognises levels of the `series` factor
* Bug fix to ensure `lfo_cv` recognises the actual times in `time`, just in case the user supplies data that doesn't start at `t = 1`. Also updated documentation to better reflect this
Expand Down
53 changes: 39 additions & 14 deletions R/ensemble.R
Original file line number Diff line number Diff line change
Expand Up @@ -102,31 +102,56 @@ ensemble.mvgam_forecast <- function(object, ..., ndraws = 5000){
# End of checks; now proceed with ensembling
n_series <- length(models[[1]]$series_names)

# Function to random sample rows of a matrix with
# replacement (in case some forecasts contain fewer draws than others)
subsamp <- function(x, nsamps){
if(NROW(x) < nsamps){
sampinds <- sample(1:NROW(x), nsamps, replace = TRUE)
} else {
sampinds <- sample(1:NROW(x), nsamps, replace = FALSE)
}

x[sampinds, ]
}
# Calculate total number of forecast draws to sample from for each model
n_mod_draws <- lapply(seq_len(n_models), function(x){
NROW(models[[x]]$forecasts[[1]])
})

# Create evenly weighted ensemble hindcasts and forecasts
# Calculate model weights (only option at the moment is even weighting,
# but this may be relaxed in future)
mod_weights <- data.frame(mod = paste0('mod', 1:n_models),
orig_weight = 1 / n_models,
ndraws = unlist(n_mod_draws,
use.names = FALSE)) %>%
# Adjust weights by the number of draws available per
# forecast, ensuring that models with fewer draws aren't
# under-represented in the final weighted ensemble
dplyr::mutate(weight = (orig_weight / ndraws) * 100) %>%
dplyr::mutate(mod = as.factor(mod)) %>%
dplyr::select(mod, weight)

# Create draw indices
mod_inds <- as.factor(unlist(lapply(seq_len(n_models), function(x){
rep(paste0('mod', x), NROW(models[[x]]$forecasts[[1]]))
}), use.names = FALSE))
all_draw_inds <- 1:sum(unlist(n_mod_draws,
use.names = FALSE))
mod_inds_draws <- split(all_draw_inds, mod_inds)

# Add model-specific weights to the draw indices
draw_weights <- data.frame(draw = all_draw_inds,
mod = mod_inds) %>%
dplyr::left_join(mod_weights, by = 'mod')

# Perform multinomial sampling using draw-specific weights
fc_draws <- sample(all_draw_inds,
size = ndraws,
replace = max(all_draw_inds) < ndraws,
prob = draw_weights$weight)

# Create weighted ensemble hindcasts and forecasts
ens_hcs <- lapply(seq_len(n_series), function(series){
all_hcs <- do.call(rbind,
lapply(models,
function(x) x$hindcasts[[series]]))
subsamp(all_hcs, ndraws)
all_hcs[fc_draws, ]
})

ens_fcs <- lapply(seq_len(n_series), function(series){
all_fcs <- do.call(rbind,
lapply(models,
function(x) x$forecasts[[series]]))
subsamp(all_fcs, ndraws)
all_fcs[fc_draws, ]
})

# Initiate the ensemble forecast object
Expand Down
2 changes: 1 addition & 1 deletion R/globals.R
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,4 @@ utils::globalVariables(c("y", "year", "smooth_vals", "smooth_num",
"value", "threshold", "colour", "resids",
"c_dark", "eval_timepoints", "yqlow",
"ymidlow", "ymidhigh", "yqhigh", "preds",
"yhigh", "ylow"))
"yhigh", "ylow", "weight", "orig_weight"))
3 changes: 2 additions & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,9 @@ knitr::opts_chunk$set(
The goal of `mvgam` is to fit Bayesian (Dynamic) Generalized Additive Models. This package constructs State-Space models that can include highly flexible nonlinear predictor effects for both process and observation components by leveraging functionalities from the impressive [`brms`](https://paulbuerkner.com/brms/){target="_blank"} and [`mgcv`](https://cran.r-project.org/web/packages/mgcv/index.html){target="_blank"} packages. This allows `mvgam` to fit a wide range of models, including hierarchical ecological models such as N-mixture or Joint Species Distribution models, as well as univariate and multivariate time series models with imperfect detection. The original motivation for the package is described in [Clark & Wells 2022](https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13974){target="_blank"} (published in *Methods in Ecology and Evolution*), with additional inspiration on the use of Bayesian probabilistic modelling coming from [Michael Betancourt](https://betanalpha.github.io/writing/){target="_blank"}, [Michael Dietze](https://www.bu.edu/earth/profiles/michael-dietze/){target="_blank"} and [Sarah Heaps](https://www.durham.ac.uk/staff/sarah-e-heaps/){target="_blank"}, among many others.

## Resources
A series of [vignettes cover data formatting, forecasting and several extended case studies of DGAMs](https://nicholasjclark.github.io/mvgam/){target="_blank"}. A number of other examples have also been compiled:
A series of [vignettes cover data formatting, forecasting and several extended case studies of DGAMs](https://nicholasjclark.github.io/mvgam/){target="_blank"}. A number of other examples, including some step-by-step introductory webinars, have also been compiled:

* [Time series in R and Stan using the `mvgam` package](https://www.youtube.com/playlist?list=PLzFHNoUxkCvsFIg6zqogylUfPpaxau_a3){target="_blank"}
* [Ecological Forecasting with Dynamic Generalized Additive Models](https://www.youtube.com/watch?v=0zZopLlomsQ){target="_blank"}
* [Distributed lags (and hierarchical distributed lags) using `mgcv` and `mvgam`](https://ecogambler.netlify.app/blog/distributed-lags-mgcv/){target="_blank"}
* [State-Space Vector Autoregressions in `mvgam`](https://ecogambler.netlify.app/blog/vector-autoregressions/){target="_blank"}
Expand Down
83 changes: 44 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,14 @@ target="_blank">Sarah Heaps</a>, among many others.

A series of <a href="https://nicholasjclark.github.io/mvgam/"
target="_blank">vignettes cover data formatting, forecasting and several
extended case studies of DGAMs</a>. A number of other examples have also
been compiled:

extended case studies of DGAMs</a>. A number of other examples,
including some step-by-step introductory webinars, have also been
compiled:

- <a
href="https://www.youtube.com/playlist?list=PLzFHNoUxkCvsFIg6zqogylUfPpaxau_a3"
target="_blank">Time series in R and Stan using the <code>mvgam</code>
package</a>
- <a href="https://www.youtube.com/watch?v=0zZopLlomsQ"
target="_blank">Ecological Forecasting with Dynamic Generalized Additive
Models</a>
Expand Down Expand Up @@ -246,38 +251,37 @@ summary(lynx_mvgam)
#>
#> GAM coefficient (beta) estimates:
#> 2.5% 50% 97.5% Rhat n_eff
#> (Intercept) 6.400 6.60 6.900 1.01 926
#> s(season).1 -0.630 -0.13 0.340 1.00 1365
#> s(season).2 0.730 1.30 1.900 1.00 1060
#> s(season).3 1.200 1.90 2.600 1.00 993
#> s(season).4 -0.087 0.55 1.200 1.00 975
#> s(season).5 -1.300 -0.70 -0.074 1.00 968
#> s(season).6 -1.300 -0.56 0.120 1.00 1252
#> s(season).7 0.032 0.73 1.400 1.00 1259
#> s(season).8 0.610 1.40 2.100 1.00 729
#> s(season).9 -0.370 0.23 0.890 1.00 829
#> s(season).10 -1.400 -0.86 -0.360 1.00 1233
#> (Intercept) 6.400 6.60 6.900 1.00 837
#> s(season).1 -0.620 -0.14 0.390 1.01 729
#> s(season).2 0.740 1.30 1.900 1.00 902
#> s(season).3 1.300 1.90 2.600 1.00 734
#> s(season).4 -0.046 0.53 1.100 1.00 945
#> s(season).5 -1.300 -0.70 -0.053 1.00 730
#> s(season).6 -1.200 -0.57 0.160 1.00 876
#> s(season).7 0.051 0.73 1.400 1.00 917
#> s(season).8 0.610 1.40 2.100 1.00 753
#> s(season).9 -0.380 0.22 0.840 1.00 717
#> s(season).10 -1.400 -0.88 -0.390 1.00 985
#>
#> Approximate significance of GAM smooths:
#> edf Ref.df Chi.sq p-value
#> s(season) 9.97 10 48.6 <2e-16 ***
#> s(season) 9.98 10 49.1 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Latent trend parameter AR estimates:
#> 2.5% 50% 97.5% Rhat n_eff
#> ar1[1] 0.60 0.83 0.98 1.01 671
#> sigma[1] 0.38 0.47 0.60 1.00 750
#> ar1[1] 0.60 0.83 0.98 1 625
#> sigma[1] 0.38 0.48 0.60 1 787
#>
#> Stan MCMC diagnostics:
#> n_eff / iter looks reasonable for all parameters
#> Rhat looks reasonable for all parameters
#> 0 of 2000 iterations ended with a divergence (0%)
#> 3 of 2000 iterations saturated the maximum tree depth of 10 (0.15%)
#> *Run with max_treedepth set to a larger value to avoid saturation
#> 0 of 2000 iterations saturated the maximum tree depth of 10 (0%)
#> E-FMI indicated no pathological behavior
#>
#> Samples were drawn using NUTS(diag_e) at Tue Dec 03 9:38:07 AM 2024.
#> Samples were drawn using NUTS(diag_e) at Mon Dec 16 10:06:22 AM 2024.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split MCMC chains
#> (at convergence, Rhat = 1)
Expand Down Expand Up @@ -420,8 +424,8 @@ plot(lynx_mvgam, type = 'forecast', newdata = lynx_test)

<img src="man/figures/README-unnamed-chunk-21-1.png" alt="Plotting forecast distributions using mvgam in R" width="60%" style="display: block; margin: auto;" />

#> Out of sample CRPS:
#> 2453.7903515
#> Out of sample DRPS:
#> 2412.582034

And the estimated latent trend component, again using the more flexible
`plot_mvgam_...()` option to show first derivatives of the estimated
Expand Down Expand Up @@ -475,31 +479,32 @@ description
```

#> Methods text skeleton
#> We used the R package mvgam (version 1.1.4; Clark & Wells, 2023) to construct, fit and interrogate the mo
#> del. mvgam fits Bayesian State-Space models that can include flexible predictor effects in both the proce
#> ss and observation components by incorporating functionalities from the brms (Bürkner 2017), mgcv (Wood 2
#> 017) and splines2 (Wang & Yan, 2023) packages. The mvgam-constructed model and observed data were passed
#> to the probabilistic programming environment Stan (version 2.34.1; Carpenter et al. 2017, Stan Developmen
#> t Team 2024), specifically through the cmdstanr interface (Gabry & Češnovar, 2021). We ran 4 Hamiltonian
#> Monte Carlo chains for 500 warmup iterations and 500 sampling iterations for joint posterior estimation.
#> Rank normalized split Rhat (Vehtari et al. 2021) and effective sample sizes were used to monitor converge
#> nce.
#> We used the R package mvgam (version 1.1.4; Clark & Wells, 2023) to construct, fit and int
#> errogate the model. mvgam fits Bayesian State-Space models that can include flexible predi
#> ctor effects in both the process and observation components by incorporating functionaliti
#> es from the brms (Burkner 2017), mgcv (Wood 2017) and splines2 (Wang & Yan, 2023) packages
#> . The mvgam-constructed model and observed data were passed to the probabilistic programmi
#> ng environment Stan (version 2.34.1; Carpenter et al. 2017, Stan Development Team 2024), s
#> pecifically through the cmdstanr interface (Gabry & Cesnovar, 2021). We ran 4 Hamiltonian
#> Monte Carlo chains for 500 warmup iterations and 500 sampling iterations for joint posteri
#> or estimation. Rank normalized split Rhat (Vehtari et al. 2021) and effective sample sizes
#> were used to monitor convergence.

#>
#> Primary references
#> Clark, NJ and Wells K (2022). Dynamic Generalized Additive Models (DGAMs) for forecasting discrete ecological time series. Methods in Ecology and Evolution, 14, 771-784. doi.org/10.1111/2041-210X.13974
#> Bürkner, PC (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1-28. doi:10.18637/jss.v080.i01
#> Burkner, PC (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1-28. doi:10.18637/jss.v080.i01
#> Wood, SN (2017). Generalized Additive Models: An Introduction with R (2nd edition). Chapman and Hall/CRC.
#> Wang W and Yan J (2021). Shape-Restricted Regression Splines with R Package splines2. Journal of Data Science, 19(3), 498-517. doi:10.6339/21-JDS1020 <https://doi.org/10.6339/21-JDS1020>.
#> Wang W and Yan J (2021). Shape-Restricted Regression Splines with R Package splines2. Journal of Data Science, 19(3), 498-517. doi:10.6339/21-JDS1020 https://doi.org/10.6339/21-JDS1020.
#> Carpenter, B, Gelman, A, Hoffman, MD, Lee, D, Goodrich, B, Betancourt, M, Brubaker, M, Guo, J, Li, P and Riddell, A (2017). Stan: A probabilistic programming language. Journal of Statistical Software 76.
#> Gabry J, Češnovar R, Johnson A, and Bronder S (2024). cmdstanr: R Interface to 'CmdStan'. https://mc-stan.org/cmdstanr/, https://discourse.mc-stan.org.
#> Vehtari A, Gelman A, Simpson D, Carpenter B, and Bürkner P (2021). Rank-normalization, folding, and localization: An improved Rhat for assessing convergence of MCMC (with discussion). Bayesian Analysis 16(2) 667-718. https://doi.org/10.1214/20-BA1221.
#> Gabry J, Cesnovar R, Johnson A, and Bronder S (2024). cmdstanr: R Interface to 'CmdStan'. https://mc-stan.org/cmdstanr/, https://discourse.mc-stan.org.
#> Vehtari A, Gelman A, Simpson D, Carpenter B, and Burkner P (2021). Rank-normalization, folding, and localization: An improved Rhat for assessing convergence of MCMC (with discussion). Bayesian Analysis 16(2) 667-718. https://doi.org/10.1214/20-BA1221.
#>
#> Other useful references
#> Arel-Bundock V (2024). marginaleffects: Predictions, Comparisons, Slopes, Marginal Means, and Hypothesis Tests. R package version 0.19.0.4, https://marginaleffects.com/.
#> Gabry J, Simpson D, Vehtari A, Betancourt M, and Gelman A (2019). Visualization in Bayesian workflow. Journal of the Royal Statatistical Society A, 182, 389-402. doi:10.1111/rssa.12378.
#> Arel-Bundock, V, Greifer, N, and Heiss, A (2024). How to interpret statistical models using marginaleffects for R and Python. Journal of Statistical Software, 111(9), 1-32. https://doi.org/10.18637/jss.v111.i09
#> Gabry J, Simpson D, Vehtari A, Betancourt M, and Gelman A (2019). Visualization in Bayesian workflow. Journal of the Royal Statatistical Society A, 182, 389-402. doi:10.1111/rssa.12378.
#> Vehtari A, Gelman A, and Gabry J (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27, 1413-1432. doi:10.1007/s11222-016-9696-4.
#> Bürkner, PC, Gabry, J, and Vehtari, A. (2020). Approximate leave-future-out cross-validation for Bayesian time series models. Journal of Statistical Computation and Simulation, 90(14), 24992523. https://doi.org/10.1080/00949655.2020.1783262
#> Burkner, PC, Gabry, J, and Vehtari, A. (2020). Approximate leave-future-out cross-validation for Bayesian time series models. Journal of Statistical Computation and Simulation, 90(14), 2499-2523. https://doi.org/10.1080/00949655.2020.1783262

## Extended observation families

Expand Down Expand Up @@ -618,7 +623,7 @@ summary(mod, include_betas = FALSE)
#> 0 of 2000 iterations saturated the maximum tree depth of 10 (0%)
#> E-FMI indicated no pathological behavior
#>
#> Samples were drawn using NUTS(diag_e) at Tue Dec 03 9:39:28 AM 2024.
#> Samples were drawn using NUTS(diag_e) at Mon Dec 16 10:07:43 AM 2024.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split MCMC chains
#> (at convergence, Rhat = 1)
Expand Down
Loading

0 comments on commit a23a51a

Please sign in to comment.