diff --git a/NEWS.md b/NEWS.md index 935c955b..2a6a8cd9 100644 --- a/NEWS.md +++ b/NEWS.md @@ -15,6 +15,7 @@ * Arguments `use_stan`, `jags_path`, `data_train`, `data_test`, `adapt_delta`, `max_treedepth` and `drift` have been removed from primary functions to streamline documentation and reflect the package's mission to deprecate 'JAGS' as a suitable backend. Both `adapt_delta` and `max_treedepth` should now be supplied in a named `list()` to the new argument `control` ## Bug fixes +* Updates to ensure `ensemble` provides appropriate weighting of forecast draws (#98) * Not necessarily a "bug fix", but this update removes several dependencies to lighten installation and improve efficiency of the workflow (#93) * Fixed a minor bug in the way `trend_map` recognises levels of the `series` factor * Bug fix to ensure `lfo_cv` recognises the actual times in `time`, just in case the user supplies data that doesn't start at `t = 1`. Also updated documentation to better reflect this diff --git a/R/ensemble.R b/R/ensemble.R index 84381e9b..3e1dd832 100644 --- a/R/ensemble.R +++ b/R/ensemble.R @@ -102,31 +102,56 @@ ensemble.mvgam_forecast <- function(object, ..., ndraws = 5000){ # End of checks; now proceed with ensembling n_series <- length(models[[1]]$series_names) - # Function to random sample rows of a matrix with - # replacement (in case some forecasts contain fewer draws than others) - subsamp <- function(x, nsamps){ - if(NROW(x) < nsamps){ - sampinds <- sample(1:NROW(x), nsamps, replace = TRUE) - } else { - sampinds <- sample(1:NROW(x), nsamps, replace = FALSE) - } - - x[sampinds, ] - } + # Calculate total number of forecast draws to sample from for each model + n_mod_draws <- lapply(seq_len(n_models), function(x){ + NROW(models[[x]]$forecasts[[1]]) + }) - # Create evenly weighted ensemble hindcasts and forecasts + # Calculate model weights (only option at the moment is even weighting, + # but this may be relaxed in future) + mod_weights <- data.frame(mod = paste0('mod', 1:n_models), + orig_weight = 1 / n_models, + ndraws = unlist(n_mod_draws, + use.names = FALSE)) %>% + # Adjust weights by the number of draws available per + # forecast, ensuring that models with fewer draws aren't + # under-represented in the final weighted ensemble + dplyr::mutate(weight = (orig_weight / ndraws) * 100) %>% + dplyr::mutate(mod = as.factor(mod)) %>% + dplyr::select(mod, weight) + + # Create draw indices + mod_inds <- as.factor(unlist(lapply(seq_len(n_models), function(x){ + rep(paste0('mod', x), NROW(models[[x]]$forecasts[[1]])) + }), use.names = FALSE)) + all_draw_inds <- 1:sum(unlist(n_mod_draws, + use.names = FALSE)) + mod_inds_draws <- split(all_draw_inds, mod_inds) + + # Add model-specific weights to the draw indices + draw_weights <- data.frame(draw = all_draw_inds, + mod = mod_inds) %>% + dplyr::left_join(mod_weights, by = 'mod') + + # Perform multinomial sampling using draw-specific weights + fc_draws <- sample(all_draw_inds, + size = ndraws, + replace = max(all_draw_inds) < ndraws, + prob = draw_weights$weight) + + # Create weighted ensemble hindcasts and forecasts ens_hcs <- lapply(seq_len(n_series), function(series){ all_hcs <- do.call(rbind, lapply(models, function(x) x$hindcasts[[series]])) - subsamp(all_hcs, ndraws) + all_hcs[fc_draws, ] }) ens_fcs <- lapply(seq_len(n_series), function(series){ all_fcs <- do.call(rbind, lapply(models, function(x) x$forecasts[[series]])) - subsamp(all_fcs, ndraws) + all_fcs[fc_draws, ] }) # Initiate the ensemble forecast object diff --git a/R/globals.R b/R/globals.R index 26aa793c..92f2a0d5 100644 --- a/R/globals.R +++ b/R/globals.R @@ -26,4 +26,4 @@ utils::globalVariables(c("y", "year", "smooth_vals", "smooth_num", "value", "threshold", "colour", "resids", "c_dark", "eval_timepoints", "yqlow", "ymidlow", "ymidhigh", "yqhigh", "preds", - "yhigh", "ylow")) + "yhigh", "ylow", "weight", "orig_weight")) diff --git a/README.Rmd b/README.Rmd index 940dff5a..b3007770 100644 --- a/README.Rmd +++ b/README.Rmd @@ -34,8 +34,9 @@ knitr::opts_chunk$set( The goal of `mvgam` is to fit Bayesian (Dynamic) Generalized Additive Models. This package constructs State-Space models that can include highly flexible nonlinear predictor effects for both process and observation components by leveraging functionalities from the impressive [`brms`](https://paulbuerkner.com/brms/){target="_blank"} and [`mgcv`](https://cran.r-project.org/web/packages/mgcv/index.html){target="_blank"} packages. This allows `mvgam` to fit a wide range of models, including hierarchical ecological models such as N-mixture or Joint Species Distribution models, as well as univariate and multivariate time series models with imperfect detection. The original motivation for the package is described in [Clark & Wells 2022](https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13974){target="_blank"} (published in *Methods in Ecology and Evolution*), with additional inspiration on the use of Bayesian probabilistic modelling coming from [Michael Betancourt](https://betanalpha.github.io/writing/){target="_blank"}, [Michael Dietze](https://www.bu.edu/earth/profiles/michael-dietze/){target="_blank"} and [Sarah Heaps](https://www.durham.ac.uk/staff/sarah-e-heaps/){target="_blank"}, among many others. ## Resources -A series of [vignettes cover data formatting, forecasting and several extended case studies of DGAMs](https://nicholasjclark.github.io/mvgam/){target="_blank"}. A number of other examples have also been compiled: +A series of [vignettes cover data formatting, forecasting and several extended case studies of DGAMs](https://nicholasjclark.github.io/mvgam/){target="_blank"}. A number of other examples, including some step-by-step introductory webinars, have also been compiled: +* [Time series in R and Stan using the `mvgam` package](https://www.youtube.com/playlist?list=PLzFHNoUxkCvsFIg6zqogylUfPpaxau_a3){target="_blank"} * [Ecological Forecasting with Dynamic Generalized Additive Models](https://www.youtube.com/watch?v=0zZopLlomsQ){target="_blank"} * [Distributed lags (and hierarchical distributed lags) using `mgcv` and `mvgam`](https://ecogambler.netlify.app/blog/distributed-lags-mgcv/){target="_blank"} * [State-Space Vector Autoregressions in `mvgam`](https://ecogambler.netlify.app/blog/vector-autoregressions/){target="_blank"} diff --git a/README.md b/README.md index 0fc4ffc2..65fdc807 100644 --- a/README.md +++ b/README.md @@ -43,9 +43,14 @@ target="_blank">Sarah Heaps, among many others. A series of vignettes cover data formatting, forecasting and several -extended case studies of DGAMs. A number of other examples have also -been compiled: - +extended case studies of DGAMs. A number of other examples, +including some step-by-step introductory webinars, have also been +compiled: + +- Time series in R and Stan using the mvgam + package - Ecological Forecasting with Dynamic Generalized Additive Models @@ -246,38 +251,37 @@ summary(lynx_mvgam) #> #> GAM coefficient (beta) estimates: #> 2.5% 50% 97.5% Rhat n_eff -#> (Intercept) 6.400 6.60 6.900 1.01 926 -#> s(season).1 -0.630 -0.13 0.340 1.00 1365 -#> s(season).2 0.730 1.30 1.900 1.00 1060 -#> s(season).3 1.200 1.90 2.600 1.00 993 -#> s(season).4 -0.087 0.55 1.200 1.00 975 -#> s(season).5 -1.300 -0.70 -0.074 1.00 968 -#> s(season).6 -1.300 -0.56 0.120 1.00 1252 -#> s(season).7 0.032 0.73 1.400 1.00 1259 -#> s(season).8 0.610 1.40 2.100 1.00 729 -#> s(season).9 -0.370 0.23 0.890 1.00 829 -#> s(season).10 -1.400 -0.86 -0.360 1.00 1233 +#> (Intercept) 6.400 6.60 6.900 1.00 837 +#> s(season).1 -0.620 -0.14 0.390 1.01 729 +#> s(season).2 0.740 1.30 1.900 1.00 902 +#> s(season).3 1.300 1.90 2.600 1.00 734 +#> s(season).4 -0.046 0.53 1.100 1.00 945 +#> s(season).5 -1.300 -0.70 -0.053 1.00 730 +#> s(season).6 -1.200 -0.57 0.160 1.00 876 +#> s(season).7 0.051 0.73 1.400 1.00 917 +#> s(season).8 0.610 1.40 2.100 1.00 753 +#> s(season).9 -0.380 0.22 0.840 1.00 717 +#> s(season).10 -1.400 -0.88 -0.390 1.00 985 #> #> Approximate significance of GAM smooths: #> edf Ref.df Chi.sq p-value -#> s(season) 9.97 10 48.6 <2e-16 *** +#> s(season) 9.98 10 49.1 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Latent trend parameter AR estimates: #> 2.5% 50% 97.5% Rhat n_eff -#> ar1[1] 0.60 0.83 0.98 1.01 671 -#> sigma[1] 0.38 0.47 0.60 1.00 750 +#> ar1[1] 0.60 0.83 0.98 1 625 +#> sigma[1] 0.38 0.48 0.60 1 787 #> #> Stan MCMC diagnostics: #> n_eff / iter looks reasonable for all parameters #> Rhat looks reasonable for all parameters #> 0 of 2000 iterations ended with a divergence (0%) -#> 3 of 2000 iterations saturated the maximum tree depth of 10 (0.15%) -#> *Run with max_treedepth set to a larger value to avoid saturation +#> 0 of 2000 iterations saturated the maximum tree depth of 10 (0%) #> E-FMI indicated no pathological behavior #> -#> Samples were drawn using NUTS(diag_e) at Tue Dec 03 9:38:07 AM 2024. +#> Samples were drawn using NUTS(diag_e) at Mon Dec 16 10:06:22 AM 2024. #> For each parameter, n_eff is a crude measure of effective sample size, #> and Rhat is the potential scale reduction factor on split MCMC chains #> (at convergence, Rhat = 1) @@ -420,8 +424,8 @@ plot(lynx_mvgam, type = 'forecast', newdata = lynx_test) Plotting forecast distributions using mvgam in R - #> Out of sample CRPS: - #> 2453.7903515 + #> Out of sample DRPS: + #> 2412.582034 And the estimated latent trend component, again using the more flexible `plot_mvgam_...()` option to show first derivatives of the estimated @@ -475,31 +479,32 @@ description ``` #> Methods text skeleton - #> We used the R package mvgam (version 1.1.4; Clark & Wells, 2023) to construct, fit and interrogate the mo - #> del. mvgam fits Bayesian State-Space models that can include flexible predictor effects in both the proce - #> ss and observation components by incorporating functionalities from the brms (Bürkner 2017), mgcv (Wood 2 - #> 017) and splines2 (Wang & Yan, 2023) packages. The mvgam-constructed model and observed data were passed - #> to the probabilistic programming environment Stan (version 2.34.1; Carpenter et al. 2017, Stan Developmen - #> t Team 2024), specifically through the cmdstanr interface (Gabry & Češnovar, 2021). We ran 4 Hamiltonian - #> Monte Carlo chains for 500 warmup iterations and 500 sampling iterations for joint posterior estimation. - #> Rank normalized split Rhat (Vehtari et al. 2021) and effective sample sizes were used to monitor converge - #> nce. + #> We used the R package mvgam (version 1.1.4; Clark & Wells, 2023) to construct, fit and int + #> errogate the model. mvgam fits Bayesian State-Space models that can include flexible predi + #> ctor effects in both the process and observation components by incorporating functionaliti + #> es from the brms (Burkner 2017), mgcv (Wood 2017) and splines2 (Wang & Yan, 2023) packages + #> . The mvgam-constructed model and observed data were passed to the probabilistic programmi + #> ng environment Stan (version 2.34.1; Carpenter et al. 2017, Stan Development Team 2024), s + #> pecifically through the cmdstanr interface (Gabry & Cesnovar, 2021). We ran 4 Hamiltonian + #> Monte Carlo chains for 500 warmup iterations and 500 sampling iterations for joint posteri + #> or estimation. Rank normalized split Rhat (Vehtari et al. 2021) and effective sample sizes + #> were used to monitor convergence. #> #> Primary references #> Clark, NJ and Wells K (2022). Dynamic Generalized Additive Models (DGAMs) for forecasting discrete ecological time series. Methods in Ecology and Evolution, 14, 771-784. doi.org/10.1111/2041-210X.13974 - #> Bürkner, PC (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1-28. doi:10.18637/jss.v080.i01 + #> Burkner, PC (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1-28. doi:10.18637/jss.v080.i01 #> Wood, SN (2017). Generalized Additive Models: An Introduction with R (2nd edition). Chapman and Hall/CRC. - #> Wang W and Yan J (2021). Shape-Restricted Regression Splines with R Package splines2. Journal of Data Science, 19(3), 498-517. doi:10.6339/21-JDS1020 . + #> Wang W and Yan J (2021). Shape-Restricted Regression Splines with R Package splines2. Journal of Data Science, 19(3), 498-517. doi:10.6339/21-JDS1020 https://doi.org/10.6339/21-JDS1020. #> Carpenter, B, Gelman, A, Hoffman, MD, Lee, D, Goodrich, B, Betancourt, M, Brubaker, M, Guo, J, Li, P and Riddell, A (2017). Stan: A probabilistic programming language. Journal of Statistical Software 76. - #> Gabry J, Češnovar R, Johnson A, and Bronder S (2024). cmdstanr: R Interface to 'CmdStan'. https://mc-stan.org/cmdstanr/, https://discourse.mc-stan.org. - #> Vehtari A, Gelman A, Simpson D, Carpenter B, and Bürkner P (2021). “Rank-normalization, folding, and localization: An improved Rhat for assessing convergence of MCMC (with discussion).” Bayesian Analysis 16(2) 667-718. https://doi.org/10.1214/20-BA1221. + #> Gabry J, Cesnovar R, Johnson A, and Bronder S (2024). cmdstanr: R Interface to 'CmdStan'. https://mc-stan.org/cmdstanr/, https://discourse.mc-stan.org. + #> Vehtari A, Gelman A, Simpson D, Carpenter B, and Burkner P (2021). Rank-normalization, folding, and localization: An improved Rhat for assessing convergence of MCMC (with discussion). Bayesian Analysis 16(2) 667-718. https://doi.org/10.1214/20-BA1221. #> #> Other useful references - #> Arel-Bundock V (2024). marginaleffects: Predictions, Comparisons, Slopes, Marginal Means, and Hypothesis Tests. R package version 0.19.0.4, https://marginaleffects.com/. - #> Gabry J, Simpson D, Vehtari A, Betancourt M, and Gelman A (2019). “Visualization in Bayesian workflow.” Journal of the Royal Statatistical Society A, 182, 389-402. doi:10.1111/rssa.12378. + #> Arel-Bundock, V, Greifer, N, and Heiss, A (2024). How to interpret statistical models using marginaleffects for R and Python. Journal of Statistical Software, 111(9), 1-32. https://doi.org/10.18637/jss.v111.i09 + #> Gabry J, Simpson D, Vehtari A, Betancourt M, and Gelman A (2019). Visualization in Bayesian workflow. Journal of the Royal Statatistical Society A, 182, 389-402. doi:10.1111/rssa.12378. #> Vehtari A, Gelman A, and Gabry J (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27, 1413-1432. doi:10.1007/s11222-016-9696-4. - #> Bürkner, PC, Gabry, J, and Vehtari, A. (2020). Approximate leave-future-out cross-validation for Bayesian time series models. Journal of Statistical Computation and Simulation, 90(14), 2499–2523. https://doi.org/10.1080/00949655.2020.1783262 + #> Burkner, PC, Gabry, J, and Vehtari, A. (2020). Approximate leave-future-out cross-validation for Bayesian time series models. Journal of Statistical Computation and Simulation, 90(14), 2499-2523. https://doi.org/10.1080/00949655.2020.1783262 ## Extended observation families @@ -618,7 +623,7 @@ summary(mod, include_betas = FALSE) #> 0 of 2000 iterations saturated the maximum tree depth of 10 (0%) #> E-FMI indicated no pathological behavior #> -#> Samples were drawn using NUTS(diag_e) at Tue Dec 03 9:39:28 AM 2024. +#> Samples were drawn using NUTS(diag_e) at Mon Dec 16 10:07:43 AM 2024. #> For each parameter, n_eff is a crude measure of effective sample size, #> and Rhat is the potential scale reduction factor on split MCMC chains #> (at convergence, Rhat = 1) diff --git a/docs/articles/data_in_mvgam.html b/docs/articles/data_in_mvgam.html index 9a7d9d95..591b613e 100644 --- a/docs/articles/data_in_mvgam.html +++ b/docs/articles/data_in_mvgam.html @@ -13,6 +13,7 @@ +