FluxCourseForecast.Rmd

---
title: "Carbon Forecast"
author: "Flux Course 2023"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(compiler)
library(tidyverse)
library(mvtnorm)
library(EML)
#remotes::install_github("eco4cast/EFIstandards")
#library(EFIstandards)
source("R/functions.R")
source("R/utils.R")

## configuration

#### SET THE ENSEMBLE SIZE
ne = 500 ## production run should be 200 - 5000, depending on what your computer can handle

timestep = 1800 #seconds

start_date = as.Date("2015-07-01")
horiz_calib = 1  #day   forecast horizon during calibration
horiz       = 35 #days, forecast horizon during forecast

outdir = "./" ## output directory for saving files
```

# Why Forecast

* Accelerate Science
* Improve Decision Making


# What: NEON Forecast Challenge Flux Forecast

* 35 days ahead (NOAA GEFS ensemble weather forecast)
* subdaily timestep
* NEON fluxes updated daily (5 day latency)
* Able to submit daily

More info is available at https://ecoforecast.org/efi-rcn-forecast-challenges/
 
# How: Forecast Analysis Cycle

* Forecast step -> Uncertainty propagation
* Analysis step -> Bayes' theorem
  
If you liked the Forecast-Analysis Cycle you may also like...

* Dietze M. 2017. Ecological Forecasting. Princeton University Press
* Dietze et al 2018. Iterative near-term ecological forecasting: Needs, opportunities, and challenges. PNAS 115 (7) 1424-1432 https://doi.org/10.1073/pnas.171023111

# Super Simple Ecosystem Model

Let's begin by defining our model itself, as well as a number of ancillary 
functions that will be useful in simulation and analysis. The model below is 
very simple but is complex enough to have some chance at capturing observed 
variability. In addition, unlike most ecosystem models, it explicitly contains 
process error. The model has three state variables (X) that are all expressed in
terms of carbon (Mg/ha): Leaf Biomass, Structural Biomass (wood, roots, etc),
and soil organic matter (SOM). The model also has only two drivers: 
photosynthetically active radiation (PAR), and air temperature. Within the model we first 
estimate LAI from Leaf Biomass and SLA. Using LAI and light we estimate GPP 
using a simple light use efficiency approach. GPP is then allocated to 
autotrophic respiration (Ra), leaf NPP, and woody NPP. These leaf and wood 
biomass pools can then turns over into SOM as litterfall and Coarse Woody Debris / mortality. 
Heterotrophic respiration is assumed to follow a standard Q10 temperature 
sensitivity. Finally, Normal process error is added to X.

```{r}
SSEM.orig
```

# Initial Conditions

Having defined our model, the next step is to define the ensemble size and 
generate an ensemble estimate of the initial state variables. To do so we'll use
the estimates that are reported in the Ameriflux BADM Meta-data files for the 
site. Since we're only relying on two different estimates of pool size to 
calculate our mean and standard deviation, and neither estimate has a reported 
error, these should be taken as "demonstration only" rather than as 
"Best practices". In a real application one would want to account for the 
sampling error associated with the number of vegetation plots or soil cores 
measured, the measurement error in the soil C and tree DBH, and the allometric 
uncertainty in converting from DBH to leaf and stem biomass. In other words, our
pool sizes are likely a lot less certain than what we take them to be in this 
exercise.

There are a few assumptions in developing the initial conditions
- The mean and standard deviation are only calculated using two observations. 
  As stated above, this is not best practice.
- The wood state is the combination of stems, coarse roots, and fine roots
- The SOM (soil) state is the combination of litter, downed coarse woody debris,
  and soil.
- All values are converted from g/m2 to Mg/ha

```{r}
### Initial State (Mg/ha)
### These specific data points were extracted from the US-NR1 Ameriflux BADM files
### https://ameriflux.lbl.gov/sites/siteinfo/US-NR1
### METADATA: https://ameriflux.lbl.gov/data/badm/badm-standards/ 
### data is also available at https://bit.ly/3POkhq9
#library(readxl)
#badm = readxl::read_xlsx(file.path("AMF_AA-Flx_FLUXNET-BIF_CCBY4_20220606.xlsx")) %>% filter(SITE_ID == "US-NR1")
#write_csv(x=badm,file="data/US-NR1_BADM.csv")
#Bwood = badm %>% filter(VARIABLE == "AG_BIOMASS_TREE") %>% select(DATAVALUE) %>% type_convert() ## unsuccessful crack at automating extraction
Bwood = 14500 * 1e-6 * 10000 ## convert g/m2 -> Mg/ha
Bleaf = 2950 * 0.01  ## generates a LAI that is too high ***
SOM = c(1.57, 1.58) + c(0.49, 1.39) + c(2.06, 2.59) * 1e-3 * 10000 ## sum up litter, CWD, and soil; change units (US-ME2)
X = as.matrix(c(mean(Bleaf), mean(Bwood), mean(SOM)))
### sample initial condition ensemble members
if(ne > 1){
  X = as.matrix(cbind(
      rnorm(ne, X[1], Bleaf*0.1), ## no variability in data, assume 10%
      rnorm(ne, X[2], Bwood*0.1), ## no variability in data, assume 10%
      rnorm(ne, X[3], sd(SOM))))
}
X.orig = X ## make a copy so that we can run different experiments

## visualize initial condition priors
pool.lab = c("leaf", "wood", "SOM")
for(i in 1:3){
  hist(X[, i], main = pool.lab[i])
}
```

# Drivers
```{r}
## data downloaded from FLUXNET2015 "fullset" product, subset to 2015 for this analysis
# flux <- read_csv("AMF_US-NR1_FLUXNET_FULLSET_HH_1998-2016_3-5.csv")
# flux <- flux %>% filter(date > as.Date("2015-01-01"))
# write_csv(x=flux,file= "data/US-NR1_Flux2015.csv")
flux <- read_csv("data/US-NR1_Flux2015.csv")
date = strptime(flux$TIMESTAMP_START,format="%Y%m%d%H%M")

inputs <- data.frame(
  date = date,
  temp = flux$TA_ERA, ## gap filled air temperature (Celsius)
  PAR  = flux$SW_IN_ERA / 0.486 ## gap filled PAR, conversion From Campbell and Norman p151
)
plot(inputs$date,inputs$PAR,type='l')
plot(inputs$date,inputs$temp,type='l')
```

# Parameters: Priors & Direct constraints (trait data)

```{r}
## ancillary data from Ameriflux BADM metadata
SLA = 1e3/c(193.7,205.1,237.7)     ## m2/kg
litterfall = c(215.8)*0.01*3 ## gC/m2/yr->Mg/ha/yr

### initial params
params = list()

## univariate priors: expert opinion
params$alpha = rlnorm(ne, log(0.02), 0.05)     ## light use efficiency
params$Q10 = rnorm(ne, 2.1, 0.1)               ## soil respiration Q10
params$Rbasal = rlnorm(ne, log(0.2), 1) / (params$Q10^2.5) ## Soil basal respiration (umol/m2/sec per Mg/ha of SOM)

## multivariate prior on allocation parameters
## assume that NPP is ~50% of GPP on average (Litton et al 2007)
Ra = 0.5                                    
## prior mean on allocation, assume leaf NPP is 31.5% of total (Quaife et al 2008)
alloc = matrix(c(Ra, (1 - 0.315) * (1 - Ra), 0.315 * (1 - Ra)), 1) 
## draw effective sample size to add stochasticity to prior
Neff = matrix(rpois(ne, 100), ne)
rdirichlet <- cmpfun(rdirichlet.orig)         ## byte compile to speed up
## prior on fractional allocation to [Ra, wood, leaf] (each draw must sum to 1)
params$falloc = rdirichlet(ne,Neff%*%alloc) 

## Process error: expert opinion
## convert gamma priors on precision into standard deviations
params$sigma.leaf = 1 / sqrt(rgamma(ne, 10, 10 * 0.01 ^ 2)) ## leaf biomass
params$sigma.stem = 1 / sqrt(rgamma(ne, 10, 10 * 0.01 ^ 2)) ## wood biomass
params$sigma.soil = 1 / sqrt(rgamma(ne, 10, 10 * 0.01 ^ 2)) ## soil carbon

## Specific leaf area
params$SLA = rnorm(ne, mean(SLA), sd(SLA)) 

## simulate litterfall turnover based on observed 
## litterfall rate and Bleaf prior (initial condition)
if(length(litterfall)>1){
  sdlitterfall = sd(litterfall)
} else {
  sdlitterfall = 0.1*litterfall ## wild assumption
}
lit = rnorm(10000,mean(litterfall),sdlitterfall/sqrt(2))/     
  rnorm(10000,mean(X[,1]),sd(X[,1])/sqrt(2))     ## X1 = leaf biomass 
## draw prior mean and sd; convert turnover per year -> turnover per timestep
lit.mu = rnorm(ne,mean(lit),sd(lit))*timestep/86400/365  
lit.sd = 1/sqrt(rgamma(ne,10,10*var(lit)))*timestep/86400/365
litterfall.param = beta.match(lit.mu,lit.sd^2)
## match moments and draw litterfall prior
params$litterfall = rbeta(ne,litterfall.param$a,litterfall.param$b) 

## draw prior mean based on background tree mortality rate of 1/142 per year (Dietze et al 2011)
mortality.mu = 1/rpois(ne,142)*timestep/86400/365             
## draw prior sd assuming a 50% CV
mortality.sd = rbeta(ne,4,4)*mortality.mu*timestep/86400/365         
## match moments and draw mortality prior
mortality.param = beta.match(mortality.mu,mortality.sd^2)
params$mortality = rbeta(ne,mortality.param$a,mortality.param$b)  

## flatten to a data frame
params = as.data.frame(params)
```

### plot parameter values
```{r, fig.asp=1}
plot_params(params)
```

# Forecast Step: Monte Carlo Error Propagation

*Ensemble forecast function*
```{r}
ensemble_forecast
```

*Run first forecast*
```{r}
dayInputs = inputs %>% dplyr::filter(date >= start_date, date < start_date + lubridate::days(horiz_calib))
output.ensemble = ensemble_forecast(X = X.orig,
                                    params = params,
                                    inputs = dayInputs[,2:3]) ## don't feed date col'n to model 
```

*Plot timeseries*
```{r}
plot_forecast(output.ensemble,sample=5)
```

If you like Ensemble Forecasting, you may also like...

* Alternative approaches to error propagation
  * Analytical distributional transformations
  * Analytical Moments
  * Linear Tangent approximations
* Uncertainty partitioning
* Predictability analyses (e.g. error vs lead time)
* Dietze 2017 Prediction in ecology: a first-principles framework. Ecological Applications 27: 2048–2060. DOI: 10.1002/eap.1589

# Assessment

At the half-hourly or hourly resolution, the _QC variable indicates if the corresponding record is a measured value (_QC = 0) or the quality level of the gap-filling that was used for that record (_QC = 1, _QC = 3 worse quality).
```{r}
## model
ci = apply(output.ensemble[, ,6], 1, quantile, c(0.025, 0.5, 0.975)) ## forecast confidence interval
                                                                     ## variable 6 = NEP
## observations
today = which(date >= start_date & date < start_date + horiz_calib)
nep = -flux$NEE_VUT_REF[today] ## change sign convention
nep.qc = flux$NEE_VUT_REF_QC[today]
nep.unc = flux$NEE_VUT_REF_JOINTUNC[today]

## plot timeseries
ylim = range(c(range(ci),range(nep,na.rm=TRUE)))
plot(dayInputs$date,ci[2,],ylim=ylim,col=1,pch=18)
ciEnvelope(dayInputs$date, ci[1, ], ci[3, ], col = col.alpha("lightGrey", 0.5))
points(dayInputs$date,nep,col=nep.qc+1)
legend("topright",legend=c("model",paste("QC",0:2)),
       pch = c(18,rep(1,3)),
       col=  c(1,1:3))

## predicted/observed plot
plot(ci[2,],nep,col=nep.qc+1)
abline(0,1,lty=2)
```

If you liked model assessment, you might also like...

* Model selection (AIC, DIC, wAIC, Predictive Loss)
* Model averaging
* Quantile-based model diagnostics & Continuous Rank Probability Score (CRPS)

# Analysis Step: Particle Filter

* What is a Likelihood?
* What is Bayes Theorem?
* How does it allow us to update model parameters and states?
* How does a Particle Filter work?

```{r}
ParticleFilter
```


```{r}
Analysis = list() ## storage for saving PF output

## data constraints, day 1
nep[nep.qc>0] = NA                   ## only use true observations, not gap-filled values
dat = data.frame(nep = nep,
                 sigma.nep = nep.unc
                )
## run PF
Analysis[[1]] = ParticleFilter(output.ensemble,params,dat,wt = rep(1,ne))

# Save the Analysis
dir.create("demo_analysis",showWarnings = FALSE)
saveRDS(Analysis[[1]],file=file.path("demo_analysis",paste0(start_date,".RDS")))
```

If you liked the Particle Filter, you may also like...

* Markov Chain Monte Carlo (MCMC)
  * JAGS, NIMBLE
  * R's BayesianTools
* Hamiltonian Monte Carlo & STAN
* State Space Models
* Kalman Filter
* Ensemble Kalman Filter
* Using uncertainty analysis (Sobol, Monte Carlo, etc) to reduce the number of parameters considered


# Workflow

Now let's put the Forecast and Analysis steps into an iterative cycle so that we can calibrate the model

```{r}
ndays = 30                                              ## number of days to run the forecast

t0 = which(date == start_date)                          ## row number of time 0 
forecast <- array(NA,c(86400/timestep*(ndays+1),ne,12)) ## output storage [time, ensemble, state]
forecast[1:dim(output.ensemble)[1],,] = output.ensemble ## store first forecast

for(t in 1:ndays){
  
  today = which(date >= (start_date + t) & date < (start_date + horiz_calib + t))
  
  # Today's forecast
  out = ensemble_forecast(X = Analysis[[t]]$X,           ## initial conditions = yesterday's Analysis
                          params = Analysis[[t]]$params, ## today's parameters = yesterday's Analysis
                          inputs = inputs[today,2:3])    ## today's subset of meteorology     
  forecast[today-t0+1,,] = out                           ## store today's forecast in overall array
  
  # Today's data constraints
  nep = -flux$NEE_VUT_REF[today]
  nep.qc = flux$NEE_VUT_REF_QC[today]
  nep[nep.qc>0] = NA
  dat = data.frame(nep = nep,
                   sigma.nep = flux$NEE_VUT_REF_JOINTUNC[today]
                  )
  
  # Today's analysis
  Analysis[[t+1]] = ParticleFilter(out,Analysis[[t]]$params, dat,wt = Analysis[[t]]$wt)
  
  # Save the Analysis
  saveRDS(Analysis[[t+1]],file=file.path("demo_analysis",paste0(start_date+t+1,".RDS")))

  
  # counter to help us know things are still running
  print(t)
}

# Save the Analysis List
saveRDS(Analysis,file="AnalysisList.RDS")

## effective sample size over time
weights = sapply(Analysis,function(x){x$wt})
Neff = apply(weights,2,function(x){
  wtn = x/sum(x)
  return(1/sum(wtn^2))})
plot(Neff,type='l')
abline(v=which(Neff > 0.999*ne),lty=3)
```

# Forecast Visualizations

Timeseries for all variables
```{r}
plot_forecast(forecast)
```


Validation against NEP
```{r}
## Timeseries
days = which(date >= start_date & date < start_date + horiz_calib + ndays)
nep = -flux$NEE_VUT_REF[days]
nep.qc = flux$NEE_VUT_REF_QC[days]
ci = apply(forecast[, ,6], 1, quantile, c(0.025, 0.5, 0.975))
ylim = range(c(range(ci),range(nep,na.rm=TRUE)))
plot(date[days],ci[2,],ylim=ylim,col=1, pch=18,ylab="NEP",type="b")
#ciEnvelope(date[days], ci[1, ], ci[3, ], col = col.alpha("lightGrey", 0.5))
points(date[days],nep,col=nep.qc+1)
legend("topright",legend=c("model",paste("QC",0:2)),
       pch = c(18,rep(1,3)),
       col=  c(1,1:3))
## predicted/observed
plot(ci[2,],nep,col=nep.qc+1,xlab="model")
abline(0,1,lty=2)

```

Compare initial and final parameters
```{r,fig.asp=1}
plot_params(Analysis[[t+1]]$params,params)
```


How did parameters evolve over time?
```{r}
## parameter timeseries
for(i in 1:12){
  p = sapply(Analysis,function(x){
    wtd.quantile(x$params[,i],x$wt,c(0.025,0.5,0.975))
  })
  time = start_date + 0:ndays
  
  plot(time,p[2,],ylim=range(p),ylab=names(params)[i])
  ciEnvelope(time,p[1,],p[3,],col=col.alpha("lightblue",0.5))
}
```


# EFI Standard: output and metadata

To be able to submit the forecast to the NEON challenge, we need to reorganize the output into community standard (long data frame) and generate metadata about the forecast

```{r}
## reorganize data in long format
dimnames(output.ensemble) <- list(as.character(date[today]),as.character(1:ne),varnames) ## label array dimensions
fx = as.data.frame.table(output.ensemble)                                       ## reorganize into long foremat
colnames(fx) = c("datetime","ensemble","variable","prediction")                 ## label columns
fx_file = file.path(outdir,paste0("terrestrial_30min-",start_date,"-SSEM.csv")) ## output filename
write_csv(fx,fx_file)
head(fx)

## metadata

## define variable names, units, etc
attributes <- tibble::tribble(
 ~attributeName,     ~attributeDefinition,                          ~unit,                  ~formatString, ~numberType, ~definition,
 "time",              "[dimension]{time}",                          "year",                 "YYYY-MM-DD HH:MM:SS",  "numberType", NA,
 "ensemble",          "[dimension]{index of ensemble member}",      "dimensionless",         NA,           "integer",    NA,
) 
variables = cbind(varnames,rep("[variable]",12),units,rep(NA,12),rep("real",12),rep(NA,12))
colnames(variables) = colnames(attributes)
attributes <- rbind(attributes,variables)
attributes
attrList <- set_attributes(attributes, 
                           col_classes = c("Date", "numeric", rep("numeric",nrow(variables))))

## sets metadata about the file itself (name, file type, size, MD5, etc)
physical <- set_physical(fx_file, recordDelimiter='\n')

## set metadata for the file as a whole
dataTable <- eml$dataTable(
                 entityName = "forecast",  ## this is a standard name 
                 entityDescription = "Carbon cycle forecast for US-NR1",
                 physical = physical,
                 attributeList = attrList)

me <- list(individualName = list(givenName = "Mike", 
                                 surName = "Dietze"),
           electronicMailAddress = "dietze@bu.edu",
           id = "https://orcid.org/0000-0002-2324-2518")

coverage <- set_coverage(begin = first(date[days]), 
               end = last(date[days]),
               geographicDescription = "Niwot Ridge, CO, USA ",
               west = -105.5464, east = -105.5464, 
               north = 40.0329, south = 40.0329)

dataset = eml$dataset(
               title = "A simple carbon cycle forecast",
               creator = me,
               contact = list(references="https://orcid.org/0000-0002-2324-2518"),
               pubDate = start_date,
               intellectualRights = "",
               abstract =  "An illustration of how we might use a particle filter to constrain a simple carbon forecast",
               dataTable = dataTable,
               coverage = coverage
               )

## EFI specific forecast metadata
additionalMetadata <- eml$additionalMetadata(
  metadata = list(
    forecast = list(
## Basic elements
      timestep = paste(timestep,"seconds"), 
      forecast_horizon = paste(days,"days"),
      start_time = start_date,
      iteration_id = Sys.time(),
      project_id = "Flux Course 2022",
      metadata_standard_version = "0.4",
      model_description = list(
        model_id = "SSEM",
        name = "Super Simple Ecosystem Model",
        type = "process-based",
        repository = "https://github.com/ecoforecast/EFActivites"
      ),
## MODEL STRUCTURE & UNCERTAINTY CLASSES
      initial_conditions = list(
        # Possible values: absent, present, data_driven, propagates, assimilates
        status = "assimilates",
        # Number of parameters / dimensionality
        complexity = 3 
      ),
      drivers = list(
        status = "data_driven",
        complexity = 2
      ),
      parameters = list(
        status = "assimilates",
        complexity = 9   
      ),
      random_effects = list(
        status = "absent"
      ),
      process_error = list(
        status = "assimilates",
        propagation = list(
          type = "ensemble", # ensemble vs analytic
          size = ne         # required if ensemble
        ),
        complexity = 3,   
        covariance = FALSE
      ),
      obs_error = list(
        status = "data_driven",
        complexity = 1,   
        covariance = FALSE
      )
    ) # forecast
  ) # metadata
) # eml$additionalMetadata

my_eml <- eml$eml(dataset = dataset,
           additionalMetadata = additionalMetadata,
           packageId = Sys.time() , 
           system = "datetime"  ## system used to generate packageId
           )

## check that the EML is also a valid EFI forecast
#EFIstandards::forecast_validator(my_eml)
meta_file = file.path(outdir,paste0("terrestrial_30min_",start_date,"_SSEM.xml"))
write_eml(my_eml, meta_file)
```


# Submission

example of how you would submit the forecast to the EFI NEON challenge (by default, doesn't actually run)
```
neon4cast::submit(forecast_file = fx_file, metadata = NULL, ask = FALSE)
```

# Run an actual forecast!

* The script forecast.R runs the forecast into the future using forecast meteorology (EFI provides NOAA GEFS 35 day, 31 ensemble)
* To set up, copy the date-stamped Analysis RDS file that you want to start off from (state and parameter initial conditons) into the `analysis` folder and change the date to the date you want to start forecasting from
* Met variables are now matrices of ensemble members that need to be sampled over!
* Every day we predict 35 days into the future, so we end up forecasting every date 35 times (with different lead times) rather than just once as we did above.
* Because of the latency in the NEON data we alse end up "reforecasting" the past few days to catch and assimilate any new observations that may have shown up since our last forecast


# Workflow automation

The first part of this file (on: workflow dispatch) used cron to schedule the workflow. The forecast in this repository is designed to run daily at 20:00 UTC.  The execution of the forecast occurs on GitHub's servers, so your local computer does not need to be turned on.  In ".github/workflow/do_prediction.yml", the lines `-cron: "* 20 * *"` define the time that the forecast is run.  In this case it is run each day at 20:00:00 UTC (note all GitHub timings are on UTC).  You can update this to run on a different schedule based on timing codes found in https://crontab.guru

The second part of this file (jobs) is used to grab a copy of a Docker container image, eco4cast/rocker-neon4cast, that has R and a large number of R packages pre-installed, including the NEON forecast challenge packages.

The final part of this file (run) tells Github Actions what Rscript it should run

* Automate to run every day!
See https://github.com/eco4cast/neon4cast-example for example. To do so you'd want to save the calibration we've done here and set up an R script that just runs the Forecast Analysis cycle for a specific start date.

A video providing more information about how to use GitHub actions for automated forecast generation can be found here: https://youtu.be/dMrUlXi4_Bo

# Next steps

* Additional sites: EFI provides met and targets for all 47 NEON tower sites. Extending beyond these is possible but required nontrivial time spent on data processing.
* Additional data constraints to the Particle Filter's Likelihood: Things that work particularly well are sensor-based observations that become available frequently (e.g., MODIS LAI, SMAP, soil respiration)
* Modify/improve the model: Lots of assumptions in SSEM are known to be false!
* Longer calibration (do parameters vary by season? by year?)