-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathproject1.qmd
559 lines (402 loc) · 24.6 KB
/
project1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
# Build your own forecast {#sec-project1}
The project challenges you to build a new model that forecasts NEON water temperature using the foundations that have been taught thus far in the book. You will be automatically submitting forecasts to the NEON Ecological Forecasting Challenge.
Here is information about generating and submitting forecasts: <https://projects.ecoforecast.org/neon4cast-ci/>
You are welcome to use any modeling or statistical framework. I provide some great resources for R packages that open the doors to many modeling frameworks below but you are not required to use them.
Expectations for project
1. Code in a GitHub repository that generates a forecast that uses a model that is "substantially" different from the linear model that we developed together. "Substantially" means you used a different modeling approach or a linear model with different variables. In the latter, you need to justify why you chose those parameters and why you think it is a better model than the one we developed together. Merely adding one more covariate because it is easy to do does not represent a "substantially" different model.
2. The code in #1 generates real-time forecasts that include uncertainty.
3. The forecast model is registered with the NEON Ecological Forecasting Challenge. Information for registering your model and submitting it can be found [here](https://projects.ecoforecast.org/neon4cast-ci/instructions.html).
4. Automated forecasts are generated using GitHub actions
## Helpful tips
You may want to use lags in your model. Lags in the target variables are used in autoregressive models. Lags in the inputs may improve your predictions because past weather may be a better predictor than current weather. Considerations for working with lags are below.
### Lags in targets
One issue to consider when working with auto-regression (lagged) models is that data may not be available on the date you start your forecast (e.g., today). In this case, the forecast needs to start at the most recent observation and run to the present day. The model is continued on into the future over the full horizon. Only the future datetimes are submitted as a true forecast. Effectively, The output of the model run to the present day is the starting point for the true forecast into the future.
### Lags in inputs
You may want to use linear modeling (like `lm`) but add more covariates. If you want to explore the use of lag or running sums in the weather inputs, you will need to combine the historical and future weather data tomorrow (otherwise you won't have a value for the lagged variable on the first day of the first). Here is some code to create a combined weather dataset.
```{r}
#| message: false
library(tidyverse)
library(lubridate)
```
```{r}
noaa_forecast_start <- Sys.Date() - days(1)
min_historical_day <- Sys.Date() - days(60)
variables <- "air_temperature"
sites <- "BARC"
noaa_past_mean <- neon4cast::noaa_stage3() |>
filter(site_id %in% sites,
variable %in% variables,
datetime > min_historical_day,
datetime < noaa_forecast_start) |>
collect() |>
mutate(datetime = as_date(datetime)) |>
summarize(prediction = mean(prediction), .by = c("datetime", "site_id", "parameter", "variable")) |>
pivot_wider(names_from = variable, values_from = prediction) |>
mutate(air_temperature = air_temperature - 273.15)
noaa_future_mean <- neon4cast::noaa_stage2(start_date = noaa_forecast_start) |>
filter(datetime >= noaa_forecast_start,
site_id %in% sites,
variable %in% variables) |>
collect() |>
mutate(datetime = as_date(datetime)) |>
summarize(prediction = mean(prediction), .by = c("datetime", "site_id", "parameter", "variable")) |>
pivot_wider(names_from = variable, values_from = prediction) |>
mutate(air_temperature = air_temperature - 273.15) |>
select(datetime, site_id, air_temperature, parameter)
combined_met_df <- bind_rows(noaa_past_mean, noaa_future_mean)
```
```{r}
ggplot(combined_met_df, aes(x = datetime, y = air_temperature, group = parameter)) +
geom_line() +
geom_vline(aes(xintercept = Sys.Date()), color = "blue")
```
Remember that the lag needs to be calculated **within an ensemble member**.
### Reforecasting
A good way to evaluate your forecast model is to generate forecasts for a starting date (e.g., `reference_datetime`) in the past. These "reforecasts" or "retroactive forecasts" use NOAA forecasts as inputs (or other forecasts) just like they are real-time forecasts so they mimic a genuine forecast of the future. Importantly, you should only use target data before the reforecast start date to train your forecast model. The key difference is that you can immediately compare the results to observations.
## Intro to Tidymodels
Tidmodels is a meta-package that provides a standardized interface with many machine-learning algorithms. This brief tutorial that provides an example application with tidymodels builds on the machine learning module in the FREC 3044: Environmental Data Science course at Virginia Tech. <https://github.com/frec-3044/machine-learning-template>.
Tidymodels does not estimate uncertainty so forecast uncertainty needs to be generated output the tidymodels framework. For example, driver uncertainty can be generated by creating tidymodel forecasts for each meteorological ensemble member.
### Overview of tidymodels steps
**Step 1: Obtain data**
Read in data to memory. In the course, we have accessed CSV, excel, and database forms located in the GitHub repo, on websites, and through an API.
**Step 2: Pre-process data**
*Split data: divide data into the training and test sets.*
- You will use the training set to fit your model and the testing set to evaluate the model performance.
- We only want to evaluate our model using data that the model has not used yet. Otherwise, we can only say that the model is good at fitting the data that was used to train it. This does not help us understand how it would perform as a tool to generate new predictions.
- You want most of your data in the training set because machine learning models can be quite "hungry for data".
- A typical split is 80% training and 20% testing but there isn't a required split.
- Data are randomly assigned to the two splits.
- Your data may have obvious groupings that you want to be equally represented in both the training and testing sets. For example, if you have cats and dogs as a variable you are using to predict tail length, you may want to be sure that your training set is not randomly full of dogs because that means it may not predict the cats well in the testing set. You can prevent this by defining the `strata` used when assigning the training and testing sets.
*Recipes: Modify predictors/features/covariates/independent variables for analysis*
- Modify the data set so that variables are correctly formatted for a particular model. For example, a linear regression requires groups to be dummy variables. Therefore, a column called "ecosystem" with two values: "forest" and "grass" would be converted to two columns: ecosystem_forest with a value of 0 or 1 and ecosystem_grass with a value of 0 and 1)
- Removing highly correlated predictors
- Rescaling predictor (e.g., converting to 0 mean and 1 standard deviation)
- transforming predictors (e.g., log)
**Step 3: Specify model and workflow**
A workflow combines the model and recipe in a single "object" that you can use for training, testing, and predicting new data.
- Define model type: linear regression, regression tree, neutral net, etc.
- Define model engine: particular R package (`lm`, `ranger`, etc.)
- Define model mode: regression or classification
- Define workflow: combine recipe and model definition
**Step 4: Train model**
- Tune hyper-parameters: hyper-parameters are configuration settings that govern how a particular ML method is fit to data. They are called "hyper" because "regular" parameters are the parameters within the model that are learned by the ML method. For example, a method called "random forecast" requires a hyper-parameter that controls the minimum size of a regression tree that is allowed. This parameter (called `min_n`) could be directly provided by you or could be tuned. The tuning process involves repeatedly fitting the model using different values of the hyper-parameter and using the hyper-parameter values that yield the best fit to the data. Importantly: not all ML methods have hyper-parameters (e.g., linear regression using the `lm` engine does not have hyper-parameters). We don't tune hyperparameters in this example. See [`example-with-tuning.Rmd`](https://github.com/frec3044/machine-learning/blob/main/assignment/example-with-tuning.Rmd) for an extension of this application that includes hyperparameter tuning.
- Fit model (using best hyper-parameter if they are tuned). The model is fit to the training data.
**Step 5: Predict**
- Predict testing data using the model that was fit to the training data. This step is critical because we only want to evaluate our model using data that the model has not used yet. Otherwise, we can only say that the model is good at fitting the data that was used to train it. This does not help understand how it would perform as a tool to generate new predictions.
- Predictions are quite easy using the `predict()` function.
**Step 6: Evaluate model**
- It is important to use the appropriate metrics to evaluate how the model performs. Some metrics only apply to classification problems and others only apply to regression problems. A list of metric types can be found [here](https://yardstick.tidymodels.org/articles/metric-types.html)
- We will be focusing on root-mean-squared error (`rmse`), a metric that subtracts each observation from the predictions, squares it, averages all the squared errors for all the data points, and then takes the square root of the mean squared error.\
- R-squared (`rsq`) is another metric that we used in the lake ice module.
**Step 7: Deploy model**
- One of the main points of using machine learning is to develop a tool that can be used with new predictors to predict data that wasn't involved in the training and testing. These are data that have all the columns necessary to make predictions but lack data in the column you are trying to predict.\
- This step is simple because it involves using the `predict()` function with the same trained model but with new data.
### Application: Predicting water temperature at NEON sites
```{r message=FALSE}
library(tidymodels)
tidymodels_prefer()
set.seed(100) #for random number generation
```
#### Step 1: Obtain data
```{r}
lake_sites <- c("BARC", "SUGG")
```
```{r}
targets <- read_csv('https://data.ecoforecast.org/neon4cast-targets/aquatics/aquatics-targets.csv.gz',
show_col_types = FALSE) |>
filter(site_id %in% lake_sites)
```
```{r}
targets <- targets |>
filter(site_id %in% lake_sites,
variable == "temperature")
```
```{r}
# past stacked weather
df_past <- neon4cast::noaa_stage3()
variables <- c("air_temperature")
noaa_past <- df_past |>
dplyr::filter(site_id %in% lake_sites,
datetime >= ymd('2017-01-01'),
variable %in% variables) |>
dplyr::collect()
```
```{r}
# aggregate the past to mean values
noaa_past_mean <- noaa_past |>
mutate(datetime = as_date(datetime)) |>
group_by(datetime, site_id, variable) |>
summarize(prediction = mean(prediction, na.rm = TRUE), .groups = "drop") |>
pivot_wider(names_from = variable, values_from = prediction) |>
# convert air temp to C
mutate(air_temperature = air_temperature - 273.15)
```
```{r}
forecast_date <- Sys.Date()
noaa_date <- forecast_date - lubridate::days(2)
df_future <- neon4cast::noaa_stage2(start_date = noaa_date)
variables <- c("air_temperature")
noaa_future <- df_future |>
dplyr::filter(reference_datetime == noaa_date,
datetime >= forecast_date,
site_id %in% lake_sites,
variable %in% variables) |>
dplyr::collect()
```
```{r}
noaa_future_daily <- noaa_future |>
mutate(datetime = as_date(datetime)) |>
# mean daily forecasts at each site per ensemble
summarize(prediction = mean(prediction), .by = c("datetime", "site_id", "parameter", "variable")) |>
pivot_wider(names_from = variable, values_from = prediction) |>
# convert to Celsius
mutate(air_temperature = air_temperature - 273.15) |>
select(datetime, site_id, air_temperature, parameter)
```
```{r}
targets_df <- targets |>
filter(variable == 'temperature') |>
pivot_wider(names_from = 'variable', values_from = 'observation') |>
left_join(noaa_past_mean,
by = c("datetime","site_id")) |>
mutate(doy = yday(datetime))
```
#### Step 2: Pre-process data
*Split data into training/testing sets*
We are going to split the data into training and testing sets using the `initial_split` function. `prop = 0.80` says to use 80% of the data in the training set.
```{r}
split <- initial_split(targets_df, prop = 0.80, strata = site_id)
```
Our split should reflect the 80/20 that we defined using `prop`
```{r}
split
```
To get the training and testing data we need to apply the `training()` and `testing()` functions to the split.
```{r}
train_data <- training(split)
test_data <- testing(split)
```
You can see that `train_data` is a data frame that we can work with.
```{r}
train_data
```
*Feature engineering using a recipe*
- requires starting with dataset that is used to provide the columns.
- a formula with the dependent variable and the predictors. If `.` is used as the predictor, that means using all columns other than the dependent variable.
- Steps that modify the data
We will use the following steps:
- `step_rm` because we don't want to use the datetime in the fit.
- `step_naomit` because there are na values in the temperature and air_temperature columns. This is used here for illustrative purposes and can also be done by filtering the target data before it is split into training and testing groups
[Here are the different recipe steps used above](https://recipes.tidymodels.org/reference/index.html)
- [Filter rows and select columns](https://recipes.tidymodels.org/reference/index.html#step-functions-filters)
Here is our recipe:
```{r}
our_recipe <- train_data |>
recipe(temperature ~ . ) |>
step_rm(datetime) |>
step_naomit(air_temperature, temperature)
```
The recipe should show the steps that will be performed when applying the recipe. Importantly, these steps have not yet been applied, we just have a recipe of what to do.
```{r}
our_recipe
```
*Step 3: Specify model, engine, and workflow*
We need to create the model and engine that we will use. In this example, we are using `linear_reg` with the mode of `regression` (as opposed to `classification`). Setting the mode for a linear regression is not necessary because it only allows regressions but it is included here for completeness.
The engine is `lm` because we are using the standard R function `lm`. There are a ton of other functions for linear regression modeling that we could use. They would be specified as a different engine.
```{r}
our_model <- linear_reg(mode = "regression") |>
set_engine("lm")
```
You will see the model, mode, and engine in the model object
```{r}
our_model
```
We now combine the model and the recipe to make a workflow that can be used to fit the training and testing data. `workflow()` initiates the workflow and `add_model` and `add_recipe` add those components to the workflow. Importantly, the workflow has not yet been applied, we just have a description of what to do.
```{r}
wflow <-
workflow() |>
add_model(our_model) |>
add_recipe(our_recipe)
```
You can see that the workflow object has all the components together
```{r}
wflow
```
*Step 4: Train model on Training Data*
We will use the workflow object to train the model. We need to provide the workflow object and the dataset to the fit function to fit (i.e., train the model)
```{r}
fit <- wflow |>
fit(data = train_data)
```
You can see that the fit object is the workflow object + the results of the model fitting
```{r}
fit
```
*Step 5: Predict Test Data*
Now we will predict the testing data using the model that was fit to the training data.
```{r}
predictions <- predict(fit, new_data = test_data)
```
The predictions are a single column called `.pred`
```{r}
predictions
```
We need to combine the `.pred` column with the testing data using the `bind_cols` function
```{r}
pred_test <- bind_cols(test_data, predictions)
```
Now we have a data frame with the prediction and all the predictors used to predict it.
```{r}
pred_test
```
*Step 6: Evaluate model*
We will evaluate the performance of our predictions of the testing data using two metrics (`rmse` and `rsq`). The function `metric_set` defines the set of metrics we will be using them. It creates a function called `multi_metric()` that we will use to calculate the metrics. We pipe in the predicted test data (`pred_test`) and tell the function that our truth (i.e., observed data) is the `temperature` column and the predictions (i.e., `estimate`) is the `.pred` column
```{r}
multi_metric <- metric_set(rmse, rsq)
metric_table <- pred_test |>
multi_metric(truth = temperature, estimate = .pred)
```
The resulting table has the metrics for evaluation
```{r}
metric_table
```
*Step 7: Deploy model*
The final step is to apply your model to predict new data.
Now read in the new data.
```{r}
targets_future <- noaa_future_daily |>
mutate(temperature = NA,
doy = yday(datetime)) |>
filter(parameter == 1) |>
select(-parameter)
```
You will notice that the `temperature` is all `NA` because you don't know what the carbon stock is of the plot.
```{r}
targets_future
```
As in "Step 5: Predict Test Data", use the model fitting on the training data (`fit`) to predict the new data.
```{r}
new_predictions <- predict(fit, new_data = targets_future)
```
```{r}
targets_future <- noaa_future_daily |>
mutate(temperature = NA,
doy = yday(datetime))
tidymodels_forecast <- data.frame()
for(i in unique(targets_future$parameter)){
curr_ens <- targets_future |>
filter(parameter == i) |>
select(-parameter)
new_predictions <- predict(fit, new_data = curr_ens)
curr_ens <- bind_cols(curr_ens, new_predictions) |>
mutate(parameter = i)
tidymodels_forecast <- bind_rows(tidymodels_forecast, curr_ens)
}
```
```{r}
tidymodels_forecasts_EFI <- tidymodels_forecast %>%
rename(prediction = .pred) %>%
mutate(variable = "temperature") |>
# For the EFI challenge we only want the forecast for future
filter(datetime > Sys.Date()) %>%
group_by(site_id, variable) %>%
mutate(reference_datetime = min(datetime) - lubridate::days(1),
family = "ensemble",
model_id = "tidymodels_lm") %>%
select(model_id, datetime, reference_datetime, site_id, family, parameter, variable, prediction)
```
```{r}
tidymodels_forecasts_EFI |>
filter(variable == "temperature") |>
ggplot(aes(x = datetime, y = prediction, group = parameter)) +
geom_line() +
facet_wrap(~site_id)
```
### Tidymodel extras
#### Changing the model approach
You can easily change the modeling approach by changing the model and engine. For example, the following code replaces the linear regression model with a random forest model from the ranger package. All other code used to fit and predict with the model stays the same.
```{r}
our_model <- rand_forest(mode = "regression") |>
set_engine("ranger")
```
A list of models that are available for you to use can be found [here](https://parsnip.tidymodels.org/reference/index.html#models). Not all models are designed for all applications. For example, forecasting water temperature requires regression models rather than classification models.
#### Tuning hyper-parameters
Some modeling approaches have parameters that govern how the model fitting is done. These are called hyperparameters. Hyperparameters are different from the parameters that are optimized using data in the modeling fitting step. Tidymodels provides tools to also fit hyperparameters. An example can be found [here](https://github.com/frec3044/machine-learning/blob/main/assignment/example-with-tuning.Rmd).
## Intro to Fable Models
The Fable modeling framework provides an interface for many common models used in forecasting. The useful thing about Fable is that it is specifically focused on forecasting so uncertainty estimation is a core feature. Fable also has an associated ebook called ["Forecasting: Principles and Practice"](https://otexts.com/fpp3/) that is an excellent resource for learning empirical methods in time-series forecasting. Similar to Tidymodels, once you have learned the general structure of how the package works, it opens up many different modeling approaches.
The example below forecasts using a simple random walk. Since the random walk starts at the last observation, it shows how to generate a forecast for variables that have a gap between the last observation and the current day.
```{r}
#| message: false
library(tsibble)
library(fable)
```
### Step 1: Set up targets
```{r}
targets <- read_csv('https://data.ecoforecast.org/neon4cast-targets/aquatics/aquatics-targets.csv.gz',
show_col_types = FALSE)
```
```{r}
max_horizon <- 35
var <- "temperature"
site <- "BARC"
```
We want to collect when to start our forecast. If we have observations from today then our forecast starts today. But if there is a gap between our last observation and today, our forecast needs to start on the date of the last observation and forecast until today. Then the forecast needs to extend into the future for our desired horizon. As a result, the horizon that we provide the `fable` model may be longer than the horizon that we want to forecast into the future. The code below calculates the start date of the forecast and the full horizon (the desired horizon in the future + the gap between the last observation and today)
```{r}
forecast_starts <- targets |>
dplyr::filter(!is.na(observation) & site_id == site & variable == var) |>
# Start the day after the most recent non-NA value
dplyr::summarise(start_date = max(datetime) + lubridate::days(1)) |> # Date
dplyr::mutate(h = (Sys.Date() - start_date) + max_horizon,
h = as.numeric(h)) |> # Horizon value
dplyr::ungroup()
forecast_starts
```
The targets have to be converted into a time-series tibble (called a `tsibble`). To make a `tsibble` you need to define the grouping variables (`key`) and the common that defines the time-series (`index`). Many modeling approaches require the datetime to be continuous (no gaps) but allow na values for values in the gaps. The `fill_gaps()` function fills the gaps with `NA`s.
```{r}
targets_use <- targets |>
dplyr::filter(site_id == site,
variable == var) %>%
tsibble::as_tsibble(key = c('variable', 'site_id'), index = 'datetime') |>
tsibble::fill_gaps() # add NA values up to today (index)
```
Once you have a `tsibble`, you combine it with a modeling approach to generate a model that is ready to be used. In this case, the model is a random walk (`RW`) with the target of `observation`.
```{r}
RW_model <- targets_use |>
fabletools::model(RW = fable::RW(observation))
```
The model is then used to generate a forecast for `h` number of time-steps in the future. The uncertainty is calculated using a bootstrap method (`bootstrap= T`) with 200 ensemble members, e.g. samples (`times = 200`).
```{r}
forecast <- RW_model |>
fabletools::generate(h = forecast_starts$h, bootstrap = T, times = 200)
```
The forecast is converted into a standard format
```{r}
RW_forecasts_EFI <- forecast %>%
rename(parameter = .rep,
prediction = .sim) %>%
# For the EFI challenge we only want the forecast for future
#filter(datetime > Sys.Date()) %>%
group_by(site_id, variable) %>%
mutate(reference_datetime = Sys.Date(),
family = "ensemble",
model_id = "persistenceRW") %>%
select(model_id, datetime, reference_datetime, site_id, family, parameter, variable, prediction)
```
and plotted. The reference_datetime is shown in blue.
```{r}
RW_forecasts_EFI |>
filter(variable == "temperature") |>
ggplot(aes(x = datetime, y = prediction, group = parameter)) +
geom_line() +
geom_vline(aes(xintercept = reference_datetime), color = "blue") +
facet_wrap(~site_id)
```
### Fable extras
#### Other models
Other models can be added at the step that the targets are combined with the modeling approach. This example is a [time-series linear model](https://otexts.com/fpp3/regression-intro.html) with a trend and seasonal term.
```{r}
#| eval: false
RW_model <- targets_use |>
fabletools::model(lm = TSLM(observation ~ trend() + season()))
```
A list of the other models are are: <https://fable.tidyverts.org/reference/index.html>