-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathbayesian-methods-intro.qmd
655 lines (483 loc) · 32.3 KB
/
bayesian-methods-intro.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
# Parameter calibration using Bayesian methods {#sec-bayes}
Including parameter uncertainty in forecasts, as demonstrated in @sec-prop-unc-pars, requires having a probability distribution for the parameters. The previous chapter shows how to estimate parameters using likelihood methods. While likelihood methods are designed to find optimal values, they are not designed to estimate the probability distribution of the parameters. By finding optimal values likelihood methods can reduce process uncertainty because a better calibrated model has a better match to historical data and smaller residual error. This chapter introduces the use of Bayesian methods to estimate parameter distributions.
## Introduction to Bayesian statistics
```{r}
#| echo: FALSE
#| message: FALSE
library(tidyverse)
library(patchwork)
set.seed(120)
```
### Starting with likelihood
Here we start where we left off with the likelihood chapter @sec-likelihood. Imagine the following dataset with five data points drawn from a normal distribution (@fig-rnorm1).
```{r}
num_data_points <- 5
mean_data <- 3.0
sd_data <- 1.0
new_data <- rnorm(num_data_points, mean = mean_data, sd = sd_data)
```
```{r}
#| fig-cap: Five random samples from a normal distribution with mean = 3 and sd = 1
#| label: fig-rnorm1
hist(new_data, main = "")
```
We can calculate the likelihood for a set of different means using the manual likelihood estimation that we did in @sec-likelihood.
```{r}
#DATA MODEL
delta_x <- 0.1
x <- seq(-10,10, delta_x)
negative_log_likelihood <- rep(NA,length(x))
for(i in 1:length(x)){
negative_log_likelihood[i] <- -sum(dnorm(new_data, mean = x[i], sd = 1, log = TRUE))
}
```
@fig-ll-bayes is the likelihood surface for different values of the mean given the observation of x = 2.
```{r}
#| fig-cap: Negative log-likelihood curve
#| label: fig-ll-bayes
plot(x, negative_log_likelihood)
```
The negative log-likelihood is useful for finding the maximum likelihood values with an optimizing function. However, here we want to convert back to a density (@fig-ll2-density)
```{r}
density_likelihood <- exp(-negative_log_likelihood)
```
```{r}
#| fig-cap: likelihood curve
#| label: fig-ll2-density
plot(x, density_likelihood, type = "l")
```
The y-axis is a tiny number because we multiplied probability densities together. This is OK for our illustrative purposes. To help visualize we can rescale so that the area under the curve is 1. The area is approximated where the density is the height and the width is the distance between points on the x-axis (`delta_x`) that we evaluated (height x width is the area of a bar under the curve). If we sum the bars together, we get the area under the curve (a value way less than 1). Dividing the likelihood by the area rescales the densities so the area under the curve is 1. Rescaling the likelihood is not a formal part of the analysis - just used here for visualizing.
```{r}
likelihood_area_under_curve <- sum(density_likelihood * delta_x) #Width * Height
density_likelihood_rescaled <- density_likelihood/likelihood_area_under_curve
```
The rescaled likelihood looks like this (@fig-rescaled-ll)
```{r}
#| fig-cap: Rescale likelihood (area under the curve is 1)
#| label: fig-rescaled-ll
d <- tibble(x = x,
likelihood = density_likelihood_rescaled)
ggplot(data = d, aes(x = x, y = likelihood)) +
geom_line() +
theme_bw()
```
Remember that this curve is the $P(data \| mean = x)$ - the probability of the data given the model. We want the $P(mean = x \| data)$ - probability of the model given the data.
Following Bayes rule where the $P(data | model)$ is the likelihood and $P(model)$ is our guess for the mean before seeing the data (called the prior)
$$
P(model | data) = \frac{P(data | model) \cdot P(model)}{P(data)}
$$
we can multiply the likelihood x the prior.
$$
P(mean | data) \simeq likelihood \cdot prior
$$ Our prior is the following normal distribution
```{r}
mean_prior <- 0
sd_prior <- 1.0
```
With the density for each value of x in @fig-prior-dens
```{r}
density_prior <- dnorm(x, mean = mean_prior, sd = sd_prior)
```
```{r}
#| fig-cap: likelihood of the x given the prior distribution
#| label: fig-prior-dens
d <- tibble(x = x,
density_prior = density_prior)
ggplot(d, aes(x = x, y = density_prior)) +
geom_line() +
theme_bw()
```
Putting the rescaled likelihood on the same plot as the prior results in @fig-prior-ll-dens
```{r}
#| fig-cap: likelihood of the x given the prior distribution and likelihood of the model.
#| label: fig-prior-ll-dens
tibble(x = x,
prior = density_prior,
likelihood_rescaled = density_likelihood_rescaled) %>%
pivot_longer(cols = -x, names_to = "distribution", values_to = "density") %>%
mutate(distribution = factor(distribution)) %>%
ggplot(aes(x = x, y = density, color = distribution)) +
geom_line() +
theme_bw()
```
Multiplying the prior x the likelihood (not rescaled) gives @fig-post-unscaled
```{r}
prior_times_likelihood <- density_prior * density_likelihood
```
```{r}
#| label: fig-post-unscaled
#| fig-cap: likelihood of the x given the prior distribution and likelihood of the model.
d <- tibble(x = x,
prior_times_likelihood = prior_times_likelihood)
ggplot(d, aes(x = x, y = prior_times_likelihood)) +
geom_line() +
theme_bw()
```
But from Bayes rule, we can rescale the prior \* likelihood by the area under the curve to convert to a probability density function.
```{r}
area_under_curve <- sum(prior_times_likelihood * delta_x) #sum(Width * Height) = total probability of the data given all possible values of the parameter
normalized_posterior <- prior_times_likelihood / area_under_curve
```
The probability of the data is (i.e., the area under likelihood \* prior curve): `r area_under_curve`
Now we can visualize the rescaled likelihood (remember that the un-scaled likelihood was used in the calculations), the prior, and the normalized posterior.
@fig-prior-ll-post shows how the posterior is a blend of the prior and the likelihood.
```{r}
#| fig-cap: Prior, likelihood, and posterior shown together
#| label: fig-prior-ll-post
tibble(x = x,
prior = density_prior,
likelihood = density_likelihood_rescaled,
normalized_posterior = normalized_posterior) %>%
pivot_longer(cols = -x, names_to = "distribution", values_to = "density") %>%
mutate(distribution = factor(distribution)) %>%
ggplot(aes(x = x, y = density, color = distribution)) +
geom_line() +
theme_bw()
```
Here is a function that will allow us to explore how the posterior is sensitive to the likelihood and the prior.
```{r}
explore_senstivity <- function(num_data_points, mean_data, sd_data, mean_prior, sd_prior, title){
new_data <- rnorm(num_data_points, mean_data, sd_data)
delta_x <- 0.1
x <- seq(-15,15, delta_x)
negative_log_likelihood <- rep(NA,length(x))
#DATA MODEL
for(i in 1:length(x)){
#Process model is that mean = x
negative_log_likelihood[i] <- -sum(dnorm(new_data, mean = x[i], sd = 1, log = TRUE))
}
density_likelihood <- exp(-negative_log_likelihood)
likelihood_area_under_curve <- sum(density_likelihood * delta_x) #Width * Height
density_likelihood_rescaled <- density_likelihood/likelihood_area_under_curve
#Prior
density_prior <- dnorm(x, mean = mean_prior, sd = sd_prior)
#Prior x Likelihood
prior_times_likelihood <- density_prior * density_likelihood
area_under_curve <- sum(prior_times_likelihood * delta_x) #Width * Height
normalized_posterior <- prior_times_likelihood / area_under_curve
p <- tibble(x = x,
prior = density_prior,
likelihood = density_likelihood_rescaled,
normalized_posterior = normalized_posterior) %>%
pivot_longer(cols = -x, names_to = "distribution", values_to = "density") %>%
mutate(distribution = factor(distribution)) %>%
ggplot(aes(x = x, y = density, color = distribution)) +
geom_line() +
labs(title = title) +
theme_bw()
return(p)
}
```
Now we can explore how the posterior is sensitive to 1) prior sd (i.e., confidence in the prior) 2) the number of data points used in the likelihood (how does increasing the number of data points influence the posterior) the prior mean 3) the mean of the data used in the likelihood (how different is it than the prior?)
```{r}
#Baseline
p1 <- explore_senstivity(num_data_points = 5,
mean_data = 3,
sd_data = 2,
mean_prior = 0,
sd_prior = 1.0,
title = "Baseline\n")
#Increase confidence in prior
p2 <- explore_senstivity(num_data_points = 5,
mean_data = 3,
sd_data = 2,
mean_prior = 0,
sd_prior = 0.1,
title = "Increase confidence\nin prior")
#Increase the number of data points
p3 <- explore_senstivity(num_data_points = 50,
mean_data = 3,
sd_data = 2,
mean_prior = 0,
sd_prior = 1.0,
title = "Increase data\n")
#Make likelihood mean closer to the prior
p4 <- explore_senstivity(num_data_points = 5,
mean_data = 0,
sd_data = 2,
mean_prior = 0,
sd_prior = 1.0,
title = "Make likelihood mean\ncloser to prior")
p4 <- explore_senstivity(num_data_points = 1,
mean_data = 10,
sd_data = 1.0,
mean_prior = 10,
sd_prior = 1.0,
title = "Make likelihood mean\ncloser to prior")
```
@fig-post-sens shows the sensitivity of the posterior to assumptions of the prior and the data.
```{r}
#| echo: false
#| fig-cap: Sensitivity of posteriors to different data and prior scenarios
#| label: fig-post-sens
(p1 + p3) / (p2 + p4) + plot_layout(guides = "collect")
```
Just like we extended the likelihood analysis to the non-linear example, we can do the same for the Bayesian analysis.
First, create a data set using the Michaelis-Menten function from the likelihood exercise. Here, instead of fitting both parameters, we only fit one parameter (the maximum or saturating value) called par1. In the chunk below we set the number of data points, the true value for par1, and the standard deviation of the data.
```{r}
num_data_points <- 10
par1_true <- 3
sd_data <- 0.5
x <- runif(num_data_points, 0, 10)
par_true <- c(par1_true, 0.5)
y_true <- par_true[1] * (x / (x + par_true[2]))
y <- rnorm(length(y_true), mean = y_true, sd = sd_data)
```
```{r}
plot(x, y, ylim = c(0, par1_true + 2))
```
Now we can define the prior. We think the prior is normally distributed with a mean and sd defined below
```{r}
mean_prior <- 1.0
sd_prior <- 0.5
```
Here is the manual calculation of the likelihood and the prior. We combine the results into a data frame for visualization.
```{r}
delta_par1 <- 0.1
par1 <- seq(-3,10, delta_par1)
negative_log_likelihood <- rep(NA,length(par1))
for(i in 1:length(par1)){
#Process model
pred <- par1[i] * (x / (x + par_true[2]))
#Data model
negative_log_likelihood[i] <- -sum(dnorm(y, mean = pred, sd = sd_data, log = TRUE))
}
density_likelihood <- exp(-negative_log_likelihood)
likelihood_area_under_curve <- sum(density_likelihood * delta_par1) #Width * Height
density_likelihood_rescaled <- density_likelihood/likelihood_area_under_curve
#Priors
density_prior <- dnorm(par1, mean = mean_prior, sd = sd_prior)
prior_times_likelihood <- density_prior * density_likelihood
area_under_curve <- sum(prior_times_likelihood * delta_par1) #Width * Height
normalized_posterior <- prior_times_likelihood / area_under_curve
```
```{r}
tibble(par1 = par1,
prior = density_prior,
likelihood = density_likelihood_rescaled,
normalized_posterior = normalized_posterior) %>%
pivot_longer(cols = -par1, names_to = "distribution", values_to = "density") %>%
mutate(distribution = factor(distribution)) %>%
ggplot(aes(x = par1, y = density, color = distribution)) +
geom_line() +
theme_bw()
```
Now we can look at how our prior and posterior distributions influence the shape of the process model curve (M-M). For illustration, the figure below shows the M-M curve using the most likely value from the prior, the most likely value if we just looked at the likelihood, and the most likely value from the posterior.
```{r}
par1_mle <- par1[which.max(density_likelihood)]
par1_post <- par1[which.max(normalized_posterior)]
d <- tibble(x = seq(0,10, 0.1),
prior = mean_prior * (x / (x + par_true[2])),
likelihood = par1_mle * (x / (x + par_true[2])),
posterior = par1_post * (x / (x + par_true[2]))) %>%
pivot_longer(cols = -x, names_to = "distribution", values_to = "prediction") %>%
mutate(distribution = factor(distribution))
```
```{r}
ggplot(d, aes(x = x, y = prediction, col = distribution)) +
geom_line() +
labs(y = "M-M model prediction (process model)") +
theme_bw()
```
## Solving a Bayesian model
The example above is designed to build an intuition for how a Bayesian analysis works but is not how the parameters in a Bayesian model are estimated in practice. For one thing, if there are multiple parameters being estimated, it is very hard to estimate the area under the curve. Second, the area is very very small if there are many data points (so small that the computer can't hold it well).
There are two common methods for estimating posterior distributions that involve numerical computation. Both methods involve randomly sampling from an unknown posterior distribution and saving the samples. The sequence of saved samples of the parameters is called a Markov chain Monte Carlo (MCMC). The distribution of parameter values in the MCMC chain is your posterior distribution.
The first is called Gibbs sampling and is used when you know the probability distribution type for the posterior but not the parameter values of the distribution. For example, if your prior is normally distributed and your likelihood is normally distributed then you know (from math that others have already done for you) that the posterior is normally distributed. Therefore, you can randomly draw from that distribution to build your MCMC chain that generates your posterior distribution. We will not go over this in detail but know that this method will give you an answer quicker but requires you (or the software you are using) to know that your prior and likelihood are conjunct (i.e., someone else has worked out that the posterior always has a certain PDF if the prior and likelihood are of certain PDF - see pages 86-89 and Table A3 in Hobbs and Hooten).
The second is called MCMC Metropolis-Hastings (MCMC-MH) and is a rejection sampling method. In this case, you don't know the form of the posterior. Basically, the MCMC-MH is the following
0) Create a vector that has a length equal to the total number of iterations you want to have in your MCMC chain. Set the first value of the vector to your parameter starting point.
1) randomly choose a new parameter value based on the previous parameter value in the MCMC chain (called a proposal). For example (where i is the current iteration in your MCMC chain): `par_proposed <- rnorm(1, mean = par[i - 1], sd = jump)`, where `jump` is the standard deviation that governs how far you want the proposed parameter to potentially be from the previous parameter.
2) use this proposed parameter in your likelihood and prior calculations and multiply the likelihood \* prior (i.e., the numerator in Bayes formula). Save this as the proposed probability(`prob_proposed`)
3) Take the ratio of the proposed probability to the previous probability (also called the current probability; `prob_current`). call this `prob_proposed`
4) Randomly select a number between 0 and 1. Call this `z`
5) If `prob_proposed` from Step 3 is greater than `z` from Step 4 then save the proposed parameter for that iteration of MCMC chain. If it is less, then assign the previous parameter value for that iteration of the MCMC chain. As a result of Step 5:
- All parameters that improve the probability will have `prob_proposed/prob_current` \> 1. Therefore all improvements will be accepted since z (by definition in Step 4) can never be greater than 1.
- Worse parameters where `prob_proposed/prob_current` \< 1, will be accepted in proportion to how worse they are. For example if `prob_proposed/prob_current` = 0.9, 90% of the time z will be less the 0.90 so the worse parameters that are worse by 10% will be accepted 90% of the time. If `prob_proposed/prob_current` = 0.01 (i.e., the new parameters are much worse), only 1% of the time will z be less than 0.01. Therefore it is possible but not common to save these worse parameters. As a result, the MCMC-MH approach explores the full distribution by spending more time at more likely parameter values. This is different than maximum likelihood optimization methods like `optim` that only save parameters that are better than the previous (thus finding the peak of the mountain rather than the shape of the mountain). The MCMC-HM approach requires taking a lot of samples so that it spends some time at very unlikely values - a necessity for estimating the tails of a distribution.
- You want to accept \~40% percent of all proposed parameter values. If your jump parameter from #1 is too large, you won't be able to explore the area around the most likely values (i.e., you won't get a lot of `prob_proposed/prob_current` values near 1). If your jump parameter from #1 is too small, you won't be able to explore the tails of the distribution (i.e., you won't get a lot of `prob_proposed/prob_current` values near 0 that have a random chance of being accepted).
**Note:** There is a step that is ignored here that just confuses at this stage. Your proposal distribution in #1 doesn't have to be normal, which is symmetric (i.e., your probability of jumping from a value of X to Y is the same as jumping from Y to X). There is an adjustment for non-symmetric proposals that is on page 71 in Dietze and page 158 in Hobbs and Hooten.
Here is an example of the MCMC-MH method:
(**Note**: the example below should use logged probability densities for numerical reasons but use the non-logged densities so that the method is clearer. The example in the assignment uses logged densities).
Set up data (same as above)
```{r}
num_data_points <- 10
par1_true <- 3
sd_data <- 0.5
x <- runif(num_data_points, 0, 10)
par_true <- c(par1_true, 0.5)
y_true <- par_true[1] * (x / (x + par_true[2]))
y <- rnorm(length(y_true), mean = y_true, sd = sd_data)
```
```{r}
plot(x, y, ylim = c(0, par1_true + 2))
```
Run MCMC-MH
```{r}
#Initialize chain
num_iter <- 1000
pars <- array(NA, dim = c(num_iter))
pars[1] <- 2
log_prob_current <- -10000000000
prob_current <- exp(log_prob_current)
jump <- 0.8
mean_prior <- 1.0
sd_prior <- 0.5
accept <- rep(NA,num_iter)
accept[1] <- 1
for(i in 2:num_iter){
#Randomly select new parameter values
proposed_pars <- rnorm(1, pars[i - 1], jump)
#PRIORS: how likely is the proposed value given the prior distribution?
prior <- dnorm(proposed_pars, mean = mean_prior, sd = sd_prior)
#PROCESS MODEL: Use new parameter values in the process model
pred <- proposed_pars * (x / (x + par_true[2]))
#DATA MODEL: how likely is the data given the proposed parameter?
#We are multiplying here
likelihood <- prod(dnorm(y, mean = pred, sd = sd_data))
#Combine the prior and likelihood
#remember that you multiply probabilities which means you can add log(probability)
prob_proposed <- prior * likelihood
z <- (prob_proposed/prob_current)
#Now pick a random number between 0 and 1
r <- runif(1, 0, 1)
#If z > r then accept the new parameters
#Note: this will always happen if the new parameters are more likely than
#the old parameters z > 1 means than z is always > r no matter what value of
#r is chosen. However, it will accept worse parameter sets (P_new is less
#likely then P_old - i.e., z < 1) in proportion to how much worse it is
#For example: if z = 0.9 and then any random number drawn by `runif` that is
#less than 0.90 will result in accepting the worse values (i.e., the slightly
#worse values will be accepted a lot of the time). In contrast, if z = 0.01
#(i.e., the new parameters are much much worse), then they can still be accepted
#but much more rarely because random r values of < 0.1 occur more rarely
#print(c(pars[i - 1], proposed_pars, z, r))
if(z > r){
accept[i] <- 1
pars[i] <- proposed_pars
prob_current <- prob_proposed
}else{
accept[i] <- 0
pars[i] <- pars[i - 1]
prob_current <- prob_current #this calculation isn't necessary but is here to show you the logic
}
}
```
The `pars` variable is our MCMC chain estimating the posterior distribution. We can visualize it in two ways. The first is with the iteration number on the x-axis. The second is a histogram. A chain that is ready for analysis will have a constant mean and variance. The variance is important because it is the exploration of the posterior distribution. The histogram shows the posterior distribution.
```{r}
#| warning: FALSE
d <- tibble(iter = 1:num_iter,
par1 = pars)
p1 <- ggplot(d, aes(x = iter, y = par1)) +
geom_line() +
theme_bw()
p2 <- ggplot(d, aes(x = par1)) +
geom_histogram() +
theme_bw()
p1 | p2
```
You should notice that the chain starts at 2 before moving to a mean of 3. The starting value of 2 was arbitrary. Since it was far from 3, the proposed new parameter values often resulted in improvements and accepting more likely values. As a result, the chain moves to the part with a mean of 3 and constant variance (i.e., where the chain has converged). This transition from the starting value to the point where the chain has converged should be discarded. We call this the "burn-in". Here are the same plots with the burn-in removed.
```{r}
#| warning: FALSE
nburn <- 100
d_burn <- tibble(iter = nburn:num_iter,
par1 = pars[nburn:num_iter])
```
```{r}
#| warning: FALSE
p1 <- ggplot(d_burn, aes(x = iter, y = par1)) +
geom_line() +
theme_bw()
p2 <- ggplot(d_burn, aes(x = par1)) +
geom_histogram() +
theme_bw()
p1 | p2
```
Now you can analyze the chain to explore the posterior distribution
```{r}
par_post_burn <- pars[nburn:num_iter]
#Mean
mean(par_post_burn)
#sd
sd(par_post_burn)
#Quantiles
quantile(par_post_burn, c(0.025, 0.5,0.975))
```
Finally, you can sample from the posterior distribution just like you would sample from a random variable using the `rnorm`, `rexp`, `rlnorm`, etc. function. The key is to randomly select an iteration (`num_sample = 1`) or a set of samples (if `num_sample > 0`) with replacement (`replace = TRUE`; i.e., an iteration could be randomly selected multiple times).
```{r}
num_samples <- 100
sample_index <- sample(x = 1:length(par_post_burn), size = num_samples, replace = TRUE)
random_draws <- par_post_burn[sample_index]
```
```{r}
hist(random_draws)
```
## Predictive posterior distributions
Finally, we can use the idea of randomly drawing from the posterior to develop predictions from the posterior. This is just like the Logistic growth module where you randomly sampled from the parameter uncertainty and used the random samples in the logistic equation.
Here we are calculating two things 1) `pred_posterior_mean` just has uncertainty in the M-M model parameter. Think about this as generating uncertainty around the mean prediction at each value of x. 2) `y_posterior` is the prediction of an observation. Therefore, you take the value from #1 and add the uncertainty in the observations from `sd_data` that was set above. This is your predictive or forecast uncertainty.
```{r}
num_samples <- 1000
x_new <- x
pred_posterior_mean <- matrix(NA, num_samples, length(x_new)) # storage for all simulations
y_posterior <- matrix(NA, num_samples, length(x_new))
for(i in 1:num_samples){
sample_index <- sample(x = 1:length(pars), size = 1, replace = TRUE)
pred_posterior_mean[i, ] <- pars[sample_index] * (x_new / (x_new + par_true[2]))
y_posterior[i, ] <- rnorm(length(x_new), pred_posterior_mean[i, ], sd = sd_data)
}
n.stats.y <- apply(y_posterior, 2, quantile, c(0.025, 0.5, 0.975))
n.stats.y.mean <- apply(y_posterior, 2, mean)
n.stats.mean <- apply(pred_posterior_mean, 2, quantile, c(0.025, 0.5, 0.975))
d <- tibble(x = x_new,
median = n.stats.y.mean,
lower95_y = n.stats.y[1, ],
upper95_y = n.stats.y[3, ],
lower95_mean = n.stats.mean[1, ],
upper95_mean = n.stats.mean[3, ],
obs = y)
```
```{r}
ggplot(d, aes(x = x)) +
geom_ribbon(aes(ymin = lower95_y, ymax = upper95_y), fill = "lightblue", alpha = 0.5) +
geom_ribbon(aes(ymin = lower95_mean, ymax = upper95_mean), fill = "pink", alpha = 0.5) +
geom_line(aes(y = median)) +
geom_point(aes(y = obs)) +
labs(y = "M-M Prediction") +
theme_bw()
```
## Important considerations
### Joint vs marginal distributions
If you have multiple parameters in your MCMC chain, you randomly draw `posterior_sample_indices` and use values for all the parameters at that iteration. By doing this you are representing the joint distribution of the parameter - e.g., if parameter 1 is high, parameter 2 is always low. If you select `posterior_sample_indices` for each parameter, you break the correlations of parameters that the MCMC method estimated and, as a result, overestimate the uncertainty.
### Identifiability
Imagine you have a model that predicts water temperature as a linear function of air temperature but uses the following equation:
water temperature = m + a \* b \* air temperature
In this toy example, the slope has been separated into two parameters (a and b). As a result, you can get the same value for a slope with a large value `a` and a small value for `b` as when you have a small value for `a` and a large value for `b`. Consequentially it is impossible to find a value for either a or b and these parameters are unidentifiable. If you collapse the a\*b down into a single parameter (m) then that parameter is identifiable.
In an ecosystem example, we often use net fluxes (like net ecosystem exchange) to calibrate models. Since NEE is the net of respiration and photosynthesis, the same NEE can emerge from a system with large values for respiration and photosynthesis as a system with small values for respiration and photosynthesis. Therefore the parameters that govern respiration and photosynthesis are not identifiable using only NEE data for parameter estimation.
Identifiability issues can be addressed by 1) adding additional data sources for parameter estimation. For example, in the next chapter, we use NEE and LAI in the calibration. The LAI data helps separate the low production from the high production system because a lower production system can not be associated with high LAI values in the model. 2) by using "stronger" priors that have distributions with less spread (see below) or 3) by modifying the equation(s) of the model to reduce the trade-off in parameters.
### Data leakage
Data that contributed to the prior distribution can be me used in the likelihood (data leaking from the prior to posterior). This will result in artificial re-enforcement of the prior and will give artificially large confidence in the posterior. Similarly, data that you collected that will be used in the likelihood can not be used to develop the prior.
You can use data that you will use in your likelihood calculation to involve the starting point of your MCMC chain. In some cases, a maximum likelihood fit of your model, like in @sec-likelihood, will yield starting points that are very close mean of the posterior (just without the uncertainty estimation). By starting your MCMC chain at these maximum likelihood estimates for the parameters, you will reduce the burn-in iterations and increase the computational efficiency of your Bayesian analysis. This is not data leakage.
## Priors
Informed, strong vs. uninformed, vague, weak
The parameters of the distribution determine how informed the prior is. For example, a normal distribution with a mean = 0 and sd = 0.1 is much more informed than one with the sd = 1000 (the latter has similar probability densities across a large range of parameter values). It is also important to consider the range of support in a distribution. For example, a normal distribution has non-zero densities for parameter values from -infinity to +infinity. Therefore the prior density for a parameter value will never be 0 (thus resulting in the prior \* likelihood = 0, and the MCMC never accepting that parameter value). In contrast, a uniform distribution has hard bounds that are user-provided. If the prior is uniform from 0 to 1, then no parameter values less than 0 or greater than 1 will be considered even if the data supports these values. Other probability distributions have embedded bounds. For example, the log-normal distribution can have values less than 0. Overall, hard bounds can be desired because of biophysical constraints on the parameter (e.g., a rate parameter or a proportion parameter can't be negative).
In some cases, they are so informed that you assume they are a fixed value. Justification for the parameter value choice when fixing a parameter or using a strong prior.
Remember that the priors are for specific parameters in the model. The information used to develop informed priors must come from comparable information about the same parameter.
- Literature review
- Databases
- Expert options/survey
### Prior key points
You may want to consider informed priors if
- There are known issues with parameter identifiability in your model
- The parameter is a physical constant (like gravity constant)
- There is considerable prior research on a parameter
- There are physical constraints to a parameter (e.g., it can't be a negative value)
You may consider using informed priors if there is a history of using a set of parameter values in previous applications of the model. In this case, be aware of zombie parameters, that is parameters whose values appear from the literature to be known because everyone uses it, but are in fact a relic of a modeler's decision in the past that has carried through many model applications without being closely examined. The parameter may be less certain than the literature indicates.
You may want to consider uninformed priors if
- You need a prior for the parameter but do not have any information about it.
- There is a desire to have the analysis reflect the data only
## Additional Bayesian topics
This chapter is an introduction to the concepts and numerical techniques used in Bayesian analysis. There are more advanced topics that may be of interest but are not covered here. These include:
- State-space modeling: state-space models are used to estimate process uncertainty for dynamic models like the forest model in @sec-process-model. When estimating process uncertainty using a state-space model, the uncertainty represents the uncertainty that accumulates between time steps in the model.
- Hierarchical models: Hierarchical models are used when you want a parameter to vary over time and space but don't want to estimate unique parameters for every space and time point. Hierarchical models allow for you to estimate a "global" parameter distribution whereby the individual parameters at a site or a point in time are samples from the global distribution. Hierarchical models allow information from one site to help inform parameter values for another site while considering them independent.
- Packages for posterior estimation: Many packages exist to estimate posteriors in Bayesian models that differ in their algorithms, capacities for complex applications, and available documentation. These include [Stan](https://cran.r-project.org/web/packages/rstan/vignettes/rstan.html), [JAGS](https://www.r-bloggers.com/2012/04/getting-started-with-jags-rjags-and-bayesian-modelling/), [Nimble](https://r-nimble.org), and [greta](https://greta-stats.org). All packages can solve simple Bayesian problems like the M-M example above. Packages have different strengths for solving more complex problems. Some are also more well-suited for use with custom functions and dynamic models like the forest model in this book.
- More robust algorithms for estimating posteriors and the use of conjecture priors to speed posterior estimation.
## Reading
## Problem Set
The problem set is located in `bayesian_problem_set.qmd` in https://github.com/frec-5174/book-problem-sets. You can fork the repository or copy the [code](https://github.com/frec-5174/book-problem-sets/blob/main/bayesian_problem_set.qmd) into your own quarto document