Skip to content

Commit

Permalink
Merge pull request #9 from nfidd/viz-session
Browse files Browse the repository at this point in the history
visualisation session
  • Loading branch information
sbfnk authored Nov 1, 2024
2 parents 66e7b67 + ac2e241 commit fe2ac9b
Show file tree
Hide file tree
Showing 3 changed files with 288 additions and 33 deletions.
269 changes: 269 additions & 0 deletions sessions/forecast-visualisation.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,269 @@
---
title: "Visualising infectious disease forecasts"
order: 1
bibliography: ueifid.bib
---

# Introduction

Epidemiological forecasts are probabilistic statements about what might happen in the future of population disease burden.
In this session we will make some simple forecasts using a commonly used infectious disease model, based on the renewal equation.
We will see how we can visualise such forecasts, and visually compare them to what really happened.

## Slides

- [Forecasting as an epidemiological problem](slides/forecasting-as-an-epidemiological-problem)

## Objectives

The aim of this session is to introduce the concept of forecasting and forecast visualisation using a simple model.

::: {.callout-note collapse="true"}

# Setup

## Source file

The source file of this session is located at `sessions/forecasting-visualisation.qmd`.

## Libraries used

In this session we will use the `nfidd` package to load a data set of infection times and corresponding forecasts, the `dplyr` package for data wrangling, and the `ggplot2` library for plotting.

```{r libraries, message = FALSE}
library("nfidd")
library("dplyr")
library("ggplot2")
```

::: {.callout-tip}
The best way to interact with the material is via the [Visual Editor](https://docs.posit.co/ide/user/ide/guide/documents/visual-editor.html) of RStudio.
:::

## Initialisation

We set a random seed for reproducibility.
Setting this ensures that you should get exactly the same results on your computer as we do.

```{r}
set.seed(123)
```

:::

# What is forecasting?

Forecasting is the process of making predictions about the future based on past and present data.
In the context of infectious disease epidemiology, forecasting is usually the process of predicting the future course of some metric of infectious disease incidence or prevalence based on past and present data.
Here we focus on forecasting observed data (the number of individuals with new symptom onset) but forecasts can also be made for other quantities of interest such as the number of infections, the reproduction number, or the number of deaths.
Epidemiological forecasting is closely related to nowcasting and, when using mechanistic approaches, estimation of the reproduction number. In fact, the model we will use for forecasting is the same as the model we used for nowcasting and estimation of the reproduction number.
The only difference is that we will extend the model into the future.
In order to make things simpler we will remove the nowcasting part of the model, but in principle all these approaches could be combined in a single model.

# Extending a model into the future

We will a data set of infections we are providing in the `nfidd` R package, and look at forecasts made using a renewal equation model.
This model simply assumes that number infectious people determines the number of future infectious people via the reproduction number $R_t$.

The reproduction number $R_t$ (sometimes called the _effective_ reproduction number) more generally describes the average number of secondary infections caused by a single infectious individual and can vary in time and space as a function of differences in population level susceptibility, changes in behaviour, policy, seasonality etc. The reproduction number $R_t$ is therefore a more general concept than the _basic_ reproduction number $R_0$ which represents the average number of secondary infections caused by a single infectious individual in a completely susceptible population.

::: {.callout-tip collapse="true"}
# The discrete renewal equation (optional)

The _renewal equation_ represents a general model of infectious disease transmission which includes the SIR model as a special case.
In its basic form it makes no assumption about the specific processes that cause $R_t$ to have a certain value and/or change over time, but instead it only relates the number of infected people in the population, the current value of the reproduction number and a delay distribution that represents the timings of when individuals infect others relative to when they themselves became infected, the so-called generation time.
Mathematically, it can be written as

$$
I_t = R_t \sum_{i=1}^{g_\mathrm{max}} I_{t-i} g_i
$$

Here, $I_t$ is the number of infected individuals on day $t$, $R_t$ is the current value of the reproduction number and $g_i$ is the probability of a secondary infection occurring $i$ days after the infector became infected themselves, with a maximum $g_\mathrm{max}$.

The forecasting model we visualise in this session is based on $R_t$ doing a random walk in time.
The step size of this random walk is estimated from data, and this is then used to project into the future.
For more information on this approach to forecasting see, for example, @funk2018real.

:::

We will now look at some data from an infectious disease outbreak and forecasts from a renewal equation model.
First, we load the data from the `nfidd` package

```{r, load-simulated-onset}
gen_time_pmf <- make_gen_time_pmf()
ip_pmf <- make_ip_pmf()
onset_df <- simulate_onsets(
make_daily_infections(infection_times), gen_time_pmf, ip_pmf
)
tail(onset_df)
```

This uses a data set of infection times that is included in the R package, and a function to turn these into daily number of observed symptom onsets.
Do not worry too much about the details of this.
The important thing is that we now have a data set `onset_df` of symptom onset.

## Visualising the forecast

We can now visualise a forecast made from a renewal equation model we used for forecasting.
Once again, this forecast is provided in the `nfidd` package which we loaded earlier.
We will first extract the forecast and then plot the forecasted number of symptom onsets alongside the observed number of symptom onsets before the forecast was made.

```{r extract-forecast}
data(rw_forecasts)
horizon <- 14
forecast_day <- 71
forecast <- rw_forecasts |>
ungroup() |>
filter(target_day == forecast_day)
previous_onsets <- onset_df |>
filter(day <= forecast_day)
```

```{r plot_forecast}
forecast |>
filter(.draw %in% sample(.draw, 100)) |>
ggplot(aes(x = day)) +
geom_line(alpha = 0.1, aes(y = .value, group = .draw)) +
geom_point(data = previous_onsets, aes(x = day, y = onsets), color = "black") +
labs(
title = "Symptom onsets",
subtitle = "Forecast (trajectories) and observed (points)"
)
```

In this plot, the dots show the data and the lines are forecast trajectories that the model deems plausible and consistent with the data so far.

::: {.callout-tip}
## Take 5 minutes
What do you think of this forecast?
Does it match your intuition of how this outbreak will continue?
Is there another way you could visualise the forecast that might be more informative?

:::

::: {.callout-note collapse="true"}
## Solution
- The forecast mainly predicts things to continue growing as they have been. However, some of the trajectories are going down, indicating that with some probabilities the outbreak might end.
- Based purely on the data and without knowing anything else about the disease and setting, it is hard to come up with an alternative. The model seems sensible given the available data. In particular, uncertainty increases with increasing distance to the data, which is a sign of a good forecasting model.
- Instead of visualising plausible trajectories we could visualise a cone with a central forecast and uncertainty around it. We will look at this in the next session as an alternative.
:::

If we want to know how well a model is doing at forecasting, we can later compare it do the data of what really happened.
If we do this, we get the following plot.

```{r plot_future_forecast}
future_onsets <- onset_df |>
filter(day <= forecast_day + horizon)
forecast |>
filter(.draw %in% sample(.draw, 100)) |>
ggplot(aes(x = day)) +
geom_line(alpha = 0.1, aes(y = .value, group = .draw)) +
geom_point(data = future_onsets, aes(x = day, y = onsets), color = "black") +
labs(
title = "Symptom onsets",
subtitle = "Forecast (trajectories) and observed (points)"
)
```

::: {.callout-tip}
## Take 5 minutes
What do you think now of this forecast?
Did the model do a good job?
Again, is there another way you could visualise the forecast that might be more informative?
:::

::: {.callout-note collapse="true"}
## Solution
- On the face of it the forecast looks very poor with some very high predictions compared to the data.
- Based on this visualisation it is hard to tell if the model is doing a good job but it seems like it is not.
- As outbreaks are generally considered to be exponential processes it might be more informative to plot the forecast on the log scale.

```{r plot_forecast_log}
forecast |>
filter(.draw %in% sample(.draw, 100)) |>
ggplot(aes(x = day)) +
geom_line(alpha = 0.1, aes(y = .value, group = .draw)) +
geom_point(data = future_onsets, aes(x = day, y = onsets), color = "black") +
scale_y_log10() +
labs(title = "Symptom onsets, log scale", subtitle = "Forecast and observed")
```

This should be a lot more informative.
We see that for longer forecast horizons the model is not doing a great job of capturing the reduction in symptom onsets.
However, we can now see that the model seems to be producing very reasonable forecasts for the first week or so of the forecast.
This is a common pattern in forecasting where a model is good at capturing the short term dynamics but struggles with the longer term dynamics.
:::

We managed to learn quite a lot about our model's forecasting limitations just looking at a single forecast using visualisations. However, what if we wanted to quantify how well the model is doing?
In order to do that we need to look at multiple forecasts

## Visualising multiple forecasts

As for a single forecast, our first step is to visualise the forecasts as this can give us a good idea of how well the model is doing without having to calculate any metrics.

```{r plot-all-forecasts}
rw_forecasts |>
filter(.draw %in% sample(.draw, 100)) |>
ggplot(aes(x = day)) +
geom_line(
aes(y = .value, group = interaction(.draw, target_day), col = target_day),
alpha = 0.1
) +
geom_point(
data = onset_df |>
filter(day >= 21),
aes(x = day, y = onsets), color = "black") +
scale_color_binned(type = "viridis") +
labs(title = "Weekly forecasts of symptom onsets over an outbreak",
col = "Forecast start day")
```


As for the single forecast it may be helpful to also plot the forecast on the log scale.

```{r plot-all-forecasts-log}
rw_forecasts |>
filter(.draw %in% sample(.draw, 100)) |>
ggplot(aes(x = day)) +
geom_line(
aes(y = .value, group = interaction(.draw, target_day), col = target_day),
alpha = 0.1
) +
geom_point(data = onset_df, aes(x = day, y = onsets), color = "black") +
scale_y_log10() +
scale_color_binned(type = "viridis") +
labs(title = "Weekly symptom onset forecasts: log scale",
col = "Forecast start day")
```

::: {.callout-tip}
## Take 5 minutes
What do you think of these forecasts?
Are they any good?
How well do they capture changes in trend?
Does the uncertainty seem reasonable?
Do they seem to under or over predict consistently?
Would you visualise the forecast in a different way?
:::

::: {.callout-note collapse="true"}
## Solution
- We think these forecasts are a reasonable place to start but there is definitely room for improvement.
Are they any good?
- They seem to do a reasonable job of capturing the short term dynamics but struggle with the longer term dynamics.
How well do they capture changes in trend?
- There is little evidence of the model capturing the reduction in onsets before it begins to show in the data.
Does the uncertainty seem reasonable?
- On the natural scale it looks like the model often over predicts. Things seem more balanced on the log scale but the model still seems to be overly uncertain.
Do they seem to under or over predict consistently?
- It looks like the model is consistently over predicting on the natural scale but this is less clear on the log scale.
:::

# Wrap up

# References

::: {#refs}
:::
37 changes: 4 additions & 33 deletions sessions/slides/forecasting-as-an-epidemiological-problem.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Forecasting as an epidemiological problem"
author: "Nowcasting and forecasting of infectious disease dynamics"
author: "Understanding, evaluating, and improving forecasts of infectious disease burden"
engine: knitr
format:
revealjs:
Expand Down Expand Up @@ -48,46 +48,17 @@ However:

[CDC: About Flu Forecasting](https://www.cdc.gov/flu/weekly/flusight/how-flu-forecasting.htm)

### Relationship with $R_t$ estimation {.smaller}

- In infectious disease epidemiology, many relevant interventions and other changes affect the strength of **transmission**
- Things that affect transmission don't affect the predicted outcomes directly, but via $R_t$
- In that sense, predicting infections comes down to **predicting $R_t$**
- Commonly, forecast models assume **no change** in $R_t$. Is this a good assumption?

### Relationship with nowcasting {.smaller}

- Nowcast: we have **some** data from the dates we're predicting
- Forecast: we have **no** data from the dates we're predicting (usually: because they're in the future)

### Forecasts are usually *probabilistic* {.smaller}

- Because the future is uncertain, it is natural to express predictions in probabilities, e.g. there is a X% chance of exceeding Y hospital admissions.

![](figures/weather_forecast.png)

### Importance of evaluation {.smaller}

- Because forecasts are unconditional (what will happen) we can compare them to data and see how well they did
- Doing this allows us to answer question like
- Are our forecasts any good?
- How far ahead can we trust forecasts?
- Which model works best for making forecasts?
- So-called **proper scoring rules** incentivise forecasters to express an honest belief about the future
- Many proper scoring rules (and other metrics) are available to assess probabilistic forecasts

### The forecasting paradigm {.smaller}

#### Maximise *sharpness* subject to *calibration*

- Statements about the future should be **correct** ("calibration")
- Statements about the future should aim to have **narrow uncertainty** ("sharpness")

## `r fontawesome::fa("laptop-code", "white")` Your Turn {background-color="#447099" transition="fade-in"}

1. Start with a model we used before and use it to make a forecast (using stan)
2. Evaluate the forecasts using proper scoring rules
1. Load some forecasts we have generated and visualise them alongside the data.
2. Explore different ways of visualising forecasts to assess how good the model performs at forecasting.

#

[Return to the session](../forecasting-concepts)
[Return to the session](../forecast-visualisation)
15 changes: 15 additions & 0 deletions sessions/ueifid.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
@Article{funk2018real,
author = {Funk, Sebastian and Camacho, Anton and Kucharski, Adam J.
and Eggo, Rosalind M. and Edmunds, W. John},
title = {Real-time forecasting of infectious disease dynamics with a
stochastic semi-mechanistic model},
journal = {Epidemics},
year = 2018,
volume = 22,
month = mar,
pages = {56–61},
issn = {1755-4365},
doi = {10.1016/j.epidem.2016.11.003},
url = {http://dx.doi.org/10.1016/j.epidem.2016.11.003},
publisher = {Elsevier BV}
}

0 comments on commit fe2ac9b

Please sign in to comment.