title | author | date | layout |
---|---|---|---|
Linear regression on a spatial grid |
Stefan Siegert |
October 2017 |
default |
We address a commonly encountered problem in weather and climate forecasting: A data set of past atmospheric model forecasts and corresponding observations is available for a few years and many grid points. We want to use this data set to learn something about systematic errors of the forecast model, and how to correct them for future forecasts. Correcting systematic forecast errors is often done by fitting a linear regression model to the observations, that uses the model forecast as a predictor. We expect systematic errors to be different for different locations, so in principle we should fit a regression model separately at each grid point. But then atmospheric data is smooth, so we would also expect the regression parameters to vary smoothly in space.
In this note I develop a method to efficiently fit linear regression models to gridded data, while accounting for spatial smoothness of the parameter estimates. The method results in improved out-of-sample performance of the regression model.
suppressPackageStartupMessages(library(rnaturalearth))
suppressPackageStartupMessages(library(maps))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(Matrix))
suppressPackageStartupMessages(library(viridis))
knitr::opts_chunk$set(
cache.path='_knitr_cache/grid-regression/',
fig.path='figure/grid-regression/'
)
borders = ne_countries(scale=110, continent='Europe') %>% map_data
Our data set consists of climate model forecasts and corresponding observations of surface temperature over a region in central Europe in summer (June-August). There are 17 years' worth of forecasts and observations (1993-2009) with one forecast per grid point per year. The climate model forecast were initialised in early May of the same year, so the forecast lead time is on seasonal time scales. The forecast and observation data were normalised to have mean zero and variance one.
load('data/grid-regression.Rdata') # available in the repository
# grid parameters
N_i = t2m %>% distinct(lat) %>% nrow
N_j = t2m %>% distinct(lon) %>% nrow
N_t = t2m %>% distinct(year) %>% nrow
N_s = N_i * N_j
# lat/lon range
lat_range = t2m %>% select(lat) %>% range
lon_range = t2m %>% select(lon) %>% range
# prepare the temperature data frame
t2m = t2m %>%
spread('type', 'temp') %>%
mutate(i = as.numeric(factor(lat, labels=1:N_i))) %>%
mutate(j = as.numeric(factor(lon, labels=1:N_j))) %>%
mutate(s = (j-1) * N_i + i)
t2m
## # A tibble: 4,641 x 8
## year lat lon fcst obs i j s
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 45.35 5.62 -0.5723529 -0.65581699 1 1 1
## 2 1 45.35 6.33 -0.5403007 -0.28856209 1 2 14
## 3 1 45.35 7.03 -0.4745686 -0.08392157 1 3 27
## 4 1 45.35 7.73 -0.5454510 -0.12503268 1 4 40
## 5 1 45.35 8.44 -0.6720850 -0.16267974 1 5 53
## 6 1 45.35 9.14 -0.6268562 -0.03320261 1 6 66
## 7 1 45.35 9.84 -0.5265686 0.27477124 1 7 79
## 8 1 45.35 10.55 -0.5040850 0.35745098 1 8 92
## 9 1 45.35 11.25 -0.4779935 0.11647059 1 9 105
## 10 1 45.35 11.95 -0.4878824 -0.06941176 1 10 118
## # ... with 4,631 more rows
Below we plot the observations at year
ggplot(t2m %>% filter(year==1)) +
geom_raster(aes(x=lon, y=lat, fill=obs)) +
coord_cartesian(xlim = lon_range, ylim=lat_range) +
geom_path(data=borders, aes(x=long, y=lat, group=group), col='white')
We have data on a grid with spatial gridpoints
At each grid point
\begin{equation} y_{s,t} = \alpha_s + \beta_s f_{s,t} + \sqrt{e^{\tau_s}} \epsilon_{s,t} \end{equation}
Specifically for our application we have
-
$y_{s,t}$ is the observation of real climate at grid point$s$ at time$t$ -
$f_{s,t}$ is the (imperfect) climate forecast for gridpoint$s$ for time$t$ that is corrected ("post-processed") by linear regression -
$\alpha_s$ ,$\beta_s$ and$\tau_s$ are local parameters of the regression model, that are assumed to be constant in time but variable in space - the residuals
$\epsilon_{s,t}$ are iid standard Normal variates
Our goal is to model the regression parameters
For spatial modelling, we collect the regression parameters in the vector
The joint log-likelihood of the observations, given the latent field
\begin{align} & \log p(y_{1,1},...y_{N_t,{N_s}} \vert x,\theta)\newline & = \sum_{s,t} \log p(y_{s,t} \vert x,\theta)\newline & \propto -\frac{N_t}{2} \sum_s \tau_s - \frac12 \sum_{s=1}^{N_s} e^{-\tau_s} \sum_{t=1}^{N_t} (y_{s,t} - \alpha_s - \beta_s f_{s,t})^2 \end{align}
The latent field
\begin{align} \log p(x \vert \theta) & = \log p(\alpha \vert \theta) + \log p(\beta \vert \theta) + \log p(\tau \vert \theta)\newline & \propto \frac12 (\log\det Q_\alpha + \log\det Q_\beta + \log\det Q_\tau) - \frac12 \left[ \alpha' Q_\alpha \alpha + \beta' Q_\beta \beta + \tau' Q_\tau \tau \right]\newline & =: \frac12 \log\det Q - \frac12 x'Qx \end{align}
and so
The precision matrix will be specified below, but first we look at the standard method of fitting individual regression models, ignoring spatial dependency.
We fit a linear regression model to each grid point individually without considering any smoothness of the regression parameters over neighboring grid points.
We simply use lm
to fit linear models at each grid point:
# fit linear models
t2m_lm = t2m %>% nest(-c(s,lat,lon)) %>%
mutate(fit = purrr::map(data, ~ lm(obs ~ fcst, data=.)))
# extract residuals and fitted values
t2m_lm_fit =
t2m_lm %>%
mutate(fit = purrr::map(fit, fortify)) %>%
unnest(fit, data) %>%
select(-c(.sigma, .cooksd, .stdresid)) %>%
mutate(resid_loo = .resid / (1 - .hat)) %>%
rename(resid = .resid,
fitted=.fitted)
# calculate mean squared errors of raw forecasts and fitted values
t2m_lm_mse = t2m_lm_fit %>%
summarise(mse = mean(resid^2),
mse_loo = mean(resid_loo^2))
# output
t2m_lm_mse %>% knitr::kable()
mse | mse_loo |
---|---|
0.6612585 | 0.8995776 |
To model spatial dependency of the regression parameters, we will use a 2d random walk model, i.e. the value at grid point
\begin{equation} x_{i,j} \sim N\left[\frac14 (x_{i-1,j} + x_{i+1,j} + x_{i,j-1} + x_{i,j+1}), \sigma^2 \right] \end{equation}
This model implies that the vector
TODO: derive matrix
# diagonals of the circular matrix D
diags = list(rep(1,N_s),
rep(-.25, N_s-1),
rep(-.25, (N_j-1)*N_i))
# missing values outside the boundary are set equal to the value on the
# boundary, i.e. values on the edges have 1/4 of themselves subtracted and
# values on the corners have 1/2 of themselves subtracted
edges = sort(unique(c(1:N_i, seq(N_i, N_s, N_i), seq(N_i+1, N_s, N_i), (N_s-N_i+1):N_s)))
corners = c(1, N_i, N_s-N_i+1, N_s)
diags[[1]][edges] = 0.75
diags[[1]][corners] = 0.5
# values on first and last row have zero contribution from i-1 and i+1
diags[[2]][seq(N_i, N_s-1, N_i)] = 0
# construct band matrix D and calculate Q
Dmat = bandSparse(n = N_s, k=c(0,1,N_i), diagonals=diags, symmetric=TRUE)
Q_alpha = Q_beta = Q_tau = crossprod(Dmat)
Below we plot a part of the precision matrix
image(Q_alpha[1:100, 1:100])
We want to calculate the posterior distribution of all the regression parameters
\begin{align} \log p(y,x) & \propto \log p(y \vert x, \theta) + \log p(x \vert \theta) + \log p(\theta)\newline & \propto -\frac{N_t}{2} \sum_s \tau_s - \frac12 \sum_{s=1}^{N_s} \sum_{t=1}^{N_t} e^{-\tau_s} (y_{s,t} - \alpha_s - \beta_s f_{s,t})^2 + \frac12 \log\det Q - \frac12 x'Qx + \log p(\theta) \end{align}
Note that the precision matrix
\begin{equation} f(x) = - \frac12 x'Qx - \frac{N_t}{2} \sum_s \tau_s - \frac12 \sum_{s=1}^{N_s} \sum_{t=1}^{N_t} e^{-\tau_s} (y_{s,t} - \alpha_s - \beta_s f_{s,t})^2 \end{equation}
Since
- an initial guess
$x_0$ for the mode is proposed - the function
$f(x)$ is Taylor-expanded to second order around$x_0$ -
$f(x) \approx \tilde{f}(x) = f(x_0) + (x - x_0)^\prime grad f(x_0) + \frac12 (x-x_0)^\prime Hf(x_0) (x-x_0)$ where$grad f$ is the gradient of$f$ and$Hf$ is the Hessian matrix of$f$ - the mode of the second order approximation
$\tilde{f}(x)$ is given by the solution to the equation$Hf(x_0) (x - x_0) = -grad f(x_0)$ and represents as an improved estimate for the true mode of$f(x)$ - a new Taylor approximation around the new estimate
$x_0$ yields a new estimate of the mode, and so the algorithm can be iterated to convergence to find the mode of$f(x)$ , and thus the posterior mode of$x$ given$y$ and$\theta$
Once the estimate of the mode
Laplace approximation:
The gradient of
\begin{align} \frac{\partial f}{\partial \alpha} & = -Q_\alpha \alpha + vec_s\left[ e^{-\tau_s} \sum_t (y_{s,t}-\alpha_s-\beta_s f_{s,t}) \right]\newline \frac{\partial f}{\partial \beta} & = -Q_\beta \beta + vec_s\left[ e^{-\tau_s} \sum_t f_{s,t}(y_{s,t}-\alpha_s-\beta_s f_{s,t}) \right]\newline \frac{\partial f}{\partial \tau} & = -Q_\tau \tau + vec_s\left[ \frac12 e^{-\tau_s} \sum_t (y_{s,t}-\alpha_s-\beta_s f_{s,t})^2 -\frac{N_t}{2} \right] \end{align}
The Hessian of
\begin{equation} Hf(x) = \left(\begin{matrix} H_{\alpha\alpha}-Q_\alpha & H_{\alpha\beta} & H_{\alpha\tau}\newline H_{\alpha\beta} & H_{\beta\beta} -Q_\beta& H_{\beta\tau}\newline H_{\alpha\tau} & H_{\beta\tau} & H_{\tau\tau} -Q_\tau \end{matrix}\right) \end{equation}
with diagonal matrices of second derivatives given by
\begin{align} H_{\alpha\alpha} & = diag_s \left[ - N_t e^{-\tau_s} \right] \newline H_{\alpha\beta} & = diag_s \left[ - e^{-\tau_s} \sum_t f_{s,t} \right] \newline H_{\alpha\tau} & = diag_s \left[ -e^{-\tau_s} \sum_t (y_{s,t} - \alpha_s - \beta_s f_{s,t}) \right] \newline H_{\beta\beta} & = diag_s \left[ -e^{-\tau_s} \sum_t f_{s,t}^2 \right] \newline H_{\beta\tau} & = diag_s \left[ -e^{-\tau_s} \sum_t f_{s,t} (y_{s,t} - \alpha_s - \beta_s f_{s,t}) \right] \newline H_{\tau\tau} & = diag_s \left[ -\frac12 e^{-\tau_s} \sum_t (y_{s,t} - \alpha_s - \beta_s f_{s,t})^2 \right] \end{align}
The function calc_f
below returns a list with the value of
calc_f = function(x, sigma2, data) {
# function that returns a list of f, grad f, hess f as function of sigma2 (the
# innovation variance of the 2d random walk)
# data is a data frame with columns: s, year, obs, fcst
N_s = data %>% distinct(s) %>% nrow
N_t = data %>% distinct(year) %>% nrow
alpha = x[1:N_s]
beta = x[1:N_s + N_s]
tau = x[1:N_s + 2*N_s]
df_x = data_frame(s = 1:N_s, alpha=alpha, beta=beta, tau=tau)
df = data %>%
# join alpha, beta, tau to data frame
left_join(df_x, by='s') %>%
# add fitted values and residuals
mutate(yhat = alpha + beta * fcst,
resid = obs - yhat) %>%
# calculate the necessary summary measures for gradient and hessian
group_by(s) %>%
summarise(
tau = tau[1],
exp_m_tau = exp(-tau),
sum_resid = sum(resid),
sum_resid_2 = sum(resid ^ 2),
sum_f_resid = sum(fcst * resid),
sum_f_2 = sum(fcst ^ 2),
sum_f = sum(fcst))
derivs = df %>% mutate(
dgda = exp_m_tau * sum_resid,
dgdb = exp_m_tau * sum_f_resid,
dgdt = -N_t/2 + 0.5 * exp_m_tau * sum_resid_2,
d2gdaa = -N_t * exp_m_tau,
d2gdbb = -exp_m_tau * sum_f_2,
d2gdtt = -0.5 * exp_m_tau * sum_resid_2,
d2gdab = -exp_m_tau * sum_f,
d2gdat = -exp_m_tau * sum_resid,
d2gdbt = -exp_m_tau * sum_f_resid) %>%
arrange(s)
# calculate f(x)
f = with(df, -N_t/2 * sum(tau) - 0.5 * sum(exp_m_tau * sum_resid_2)) -
0.5 * drop(crossprod(alpha, Q_alpha) %*% alpha +
crossprod(beta, Q_beta) %*% beta +
crossprod(tau, Q_tau) %*% tau) / sigma2
# the gradient
grad_f = c(-drop(Q_alpha %*% alpha) / sigma2 + derivs$dgda,
-drop(Q_beta %*% beta) / sigma2 + derivs$dgdb,
-drop(Q_tau %*% tau) / sigma2 + derivs$dgdt)
# the hessian
hess_f =
- bdiag(Q_alpha, Q_beta, Q_tau) / sigma2 +
with(derivs, bandSparse(
n = 3 * N_s,
k = c(0, N_s, 2 * N_s),
diagonals= list(c(d2gdaa, d2gdbb, d2gdtt), c(d2gdab, d2gdbt), d2gdat),
symmetric=TRUE))
return(list(f=f, grad_f=grad_f, hess_f=hess_f))
}
The function calc_x0
uses the derivatives returned by calc_f
to iteratively find the mode of
calc_x0 = function(sigma2, data, lambda=1, tol=1e-4) {
N_s = data %>% distinct(s) %>% nrow
x0 = rep(0, 3*N_s) # initial guess is x0 = 0
converged = FALSE
while(!converged) {
f_list = calc_f(x0, sigma2, data)
# Levenberg-Marquardt method: increase the diagonal of Hf to improve
# conditioning before calling solve(hess_f, -grad_f)
f_list = within(f_list, {diag(hess_f) = diag(hess_f) * (1 + lambda)})
x = x0 + with(f_list, solve(hess_f, - grad_f))
if (mean((x-x0)^2) < tol) {
converged = TRUE
} else {
x0 = x
}
}
return(list(x0=x, Hf=f_list$hess_f))
}
# estimate the model with smoothing parameter sigma2 = 0.1
x0_list = calc_x0(sigma2=0.1, data=t2m, lambda=1, tol=1e-4)
# extract mode and marginal stdev of the Laplace approximation
x0 = x0_list$x0
x0_sd = with(x0_list, sqrt(diag(solve(-Hf))))
# extract alpha, beta, tau
df_x0 = data_frame(
s = 1:N_s,
alpha_hat = x0[1:N_s] , alpha_hat_sd = x0_sd[1:N_s],
beta_hat = x0[1:N_s + N_s] , beta_hat_sd = x0_sd[1:N_s + N_s],
tau_hat = x0[1:N_s + 2*N_s], tau_hat_sd = x0_sd[1:N_s + 2*N_s]
)
# combine with temperature data frame
df_out = t2m %>%
left_join(df_x0, by='s') %>%
mutate(y_hat = alpha_hat + beta_hat * fcst) %>%
mutate(resid_hat = obs - y_hat) %>%
mutate(resid = obs - fcst)
df_out
## # A tibble: 4,641 x 17
## year lat lon fcst obs i j s
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 45.35 5.62 -0.5723529 -0.65581699 1 1 1
## 2 1 45.35 6.33 -0.5403007 -0.28856209 1 2 14
## 3 1 45.35 7.03 -0.4745686 -0.08392157 1 3 27
## 4 1 45.35 7.73 -0.5454510 -0.12503268 1 4 40
## 5 1 45.35 8.44 -0.6720850 -0.16267974 1 5 53
## 6 1 45.35 9.14 -0.6268562 -0.03320261 1 6 66
## 7 1 45.35 9.84 -0.5265686 0.27477124 1 7 79
## 8 1 45.35 10.55 -0.5040850 0.35745098 1 8 92
## 9 1 45.35 11.25 -0.4779935 0.11647059 1 9 105
## 10 1 45.35 11.95 -0.4878824 -0.06941176 1 10 118
## # ... with 4,631 more rows, and 9 more variables: alpha_hat <dbl>,
## # alpha_hat_sd <dbl>, beta_hat <dbl>, beta_hat_sd <dbl>, tau_hat <dbl>,
## # tau_hat_sd <dbl>, y_hat <dbl>, resid_hat <dbl>, resid <dbl>
We want to compare the models by their mean squared prediction errors between
# extract parameter estimates from spatial model
df_pars_rw = df_out %>% filter(year==1) %>%
select(lat, lon, s, alpha_hat, beta_hat, tau_hat) %>%
rename(alpha = alpha_hat, beta = beta_hat, tau = tau_hat) %>%
mutate(type='rw')
# extract parameter estimates from independent model
df_pars_lm = t2m_lm %>%
mutate(alpha = map_dbl(fit, ~ .$coefficients[1]),
beta = map_dbl(fit, ~ .$coefficients[2]),
tau = map_dbl(fit, ~ log(sum(.$residuals^2)) - log(.$df.residual)),
type='lm')
# combine into one data frame, ignore alpha since it is zero for normalised
# data
df_pars = bind_rows(df_pars_lm, df_pars_rw) %>%
select(-alpha) %>%
gather('parameter', 'estimate', c(beta, tau))
We plot the estimated regression parameters
ggplot(df_pars) +
geom_raster(aes(x=lon, y=lat, fill=estimate)) +
facet_grid(type~parameter) +
geom_path(data=borders, aes(x=long, y=lat, group=group), col='white') +
coord_cartesian(xlim=lon_range, ylim=lat_range) +
scale_fill_viridis()
The estimation and smoothing of the regression parameters seems to work.
The estimated parameters are similar, but the fields on the bottom are smoother because we have built spatial dependency into the model.
By varying the parameter sigma2
in the function calc_x0
higher or lower degrees of smoothing can be achieved.
TODO: illustrate effect of different smoothing parameters
The important question is now whether or not the spatial modeling has improved the transformed forecasts over independent estimation. First of all we look at the mean squared residuals of the two models, i.e. the mean squared prediction errors (MSEs) calculated on the training data:
df_mse_insample_lm = t2m_lm_mse %>% select(-mse_loo) %>% mutate(type='lm')
df_mse_insample_rw = df_out %>% group_by(s,lat,lon) %>% summarise(mse = mean(resid^2)) %>% mutate(type='rw')
df_mse_insample = bind_rows(df_mse_insample_lm, df_mse_insample_rw)
df_mse_insample %>% group_by(type) %>% summarise(mse = mean(mse)) %>% knitr::kable()
type | mse |
---|---|
lm | 0.6612585 |
rw | 0.6783852 |
It is not surprising that the in-sample residuals are smaller for the independent model than for the spatially dependent model, since sum of squared residuals is exactly the quantity that is minimised by standard linear regression.
Any modification will necessarily increase the sum of squared residuals.
To evaluate the models, we will have to look at their performances out-of-sample, i.e. how well they predict values of
df_cv = t2m %>% mutate(y_hat = NA_real_)
for (yr_loo in unlist(distinct(t2m, year))) {
# create training data set
df_trai = t2m %>% filter(year != yr_loo)
# estimate regression parameters
x0 = calc_x0(sigma2=0.1, data=df_trai, lambda=1, tol=1e-4)$x0
alpha_hat = x0[1:N_s]
beta_hat = x0[1:N_s + N_s]
tau_hat = x0[1:N_s + 2*N_s]
# append fitted value to data frame
df_cv = df_cv %>%
mutate(y_hat = ifelse(year == yr_loo, alpha_hat + beta_hat * fcst, y_hat))
}
df_mse_outofsample_lm = t2m_lm_fit %>% select(s, lat, lon, year, resid_loo) %>% mutate(type='lm')
df_mse_outofsample_rw = df_cv %>% mutate(resid_loo = y_hat - obs) %>% select(s, lat, lon, year, resid_loo) %>% mutate(type='rw')
df_mse_outofsample =
bind_rows(df_mse_outofsample_lm, df_mse_outofsample_rw) %>%
group_by(type) %>% summarise(mse = mean(resid_loo^2))
df_mse_outofsample %>% knitr::kable()
type | mse |
---|---|
lm | 0.8995776 |
rw | 0.8524261 |
When evaluated out-of-sample in a leave-one-out cross validation, the regression model with smoothed parameters performs better than the regression model with independently fitted paramters.
The improvement in mean squared error is close to
This is encouraging.
Since we have not yet optimised (or integrated over) the smoothing parameter
TODO:
- compare posterior standard deviations of the spatial model with standard errors of regression parameters
- explore
$\sigma^2$ to estimate the posterior$p(\theta \vert y)$ - integrate over
$\sigma^2$ to estimate$p(x \vert y)$ - consider different
$\sigma^2$ for$\alpha$ ,$\beta$ , and$\tau$ - consider different spatial models