-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtsa.Rmd
2629 lines (1996 loc) · 124 KB
/
tsa.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Time Series Analysis Module"
author: "Institutul National de Sanatate Publica (INSP), Bucharest, Romania"
geometry: margin=2cm
output:
# pdf_document:
# dev: cairo_pdf
# fig_width: 12
# highlight: haddock
# includes:
# in_header: preamble_latex.tex
# keep_tex: yes
# latex_engine: pdflatex
# toc: yes
# html_document:
# dev: svg
# fig_width: 12
# toc: yes
word_document:
fig_width: 12
highlight: haddock
toc: yes
linkcolor: blue
linestretch: 1.15
fontsize: 12pt
---
```{r setup, include=FALSE}
rm(list=ls())
library(knitr)
library(memisc)
library(pander) # Use GitHub version 0.6.0 addressing memisc issue
# devtools::install_github('Rapporter/pander')
library(rmdformats)
opts_chunk$set(
echo = TRUE,
# eval = FALSE, # Uncomment to remove R output
# cache = TRUE, # Uncomment after changes to code
prompt = FALSE,
warning = FALSE,
message = FALSE)
opts_knit$set(width=80)
options(max.print="80")
options(scipen=6)
```
> Version 6.0, November 2015
## Disclaimer
The information presented in this exercise and the
associated data files have been deliberately changed so as to facilitate
the acquisition of the expected learning outcomes for fellows of EPIET
and EPIET-associated programmes of cohort 20 (cohort 2014).
This case study was first introduced in 2011 and has been being adapted ever since
(see Copyright and Licence agreement for more detailed information).
# Copyright and License
## Source
This case study was first designed by Alicia Barrasa Blanco and Ioannis
Karagiannis in 2011 for the training needs of EPIET, PAE, NorFETP and
Austrian FETP fellows of Cohort 16 (2010).
## Revisions
November 2012: No major changes.
November 2013: Removal of the Legionnaire's disease in the Netherlands
case study; removal of the acute respiratory infections in Peru exercise
and substitution with the mortality in Spain one; substitution of the
pergram with the epergram command.
November 2014: Major stylistic changes throughout; removal of the
Puumala virus example; inclusion of information reminders in bubbles
throughout; *Salmonella* exercise made optional; adjustment of learning
objectives; removal of the prediction, and forecasting parts; loops in
Stata made optional.
April 2015: Adjusted for delivery in a three-day module, including an
adaptation of the expected learning outcomes; removal of the measles
example; minor stylistic changes throughout; removal of the manual way
of computing moving averages; removal of parts 2 and 3 under outbreak
detection; removal of the loops in Stata.
November 2015: Re-adaptation to a five-day module; minor stylistic
changes throughout; major adaptation of the expected learning outcomes
to strengthen the link with routine work in and evaluation of
surveillance systems; introduction of the isoweek command in Stata;
reintroduction of forecasting and prediction; introduction of ARIMA
models, evaluation of interventions using surveillance data,
capture-recapture studies in evaluating a surveillance systems
sensitivity.
**October 2016: Adaptation to R; minor changes throughout. ARIMA models,
capture-recapture studies left out.**
You are free:
- **to Share** to copy and distribute the work
- **to Remix** to adapt and build upon the material
Under the following conditions:
- **Attribution** You must attribute the work in the manner
specified by the author or licensor (but not in any way that suggests
that they endorse you or your use of the work). The best way to do this
is to keep as it is the list of contributors: sources, authors and
reviewers.
- **Share Alike** If you alter, transform, or build upon this
work, you may distribute the resulting work only under the same or
similar license to this one. Your changes must be documented. Under that
condition, you are allowed to add your name to the list of contributors.
- **Notification** If you use the work in the manner specified
by the author or licensor, notify Ioannis Karagiannis
([[email protected]](mailto:[email protected]))
and Alicia Barrasa Blanco ([[email protected]](mailto:[email protected])).
You cannot sell this work alone but you can use it as part of
teaching with the understanding that:
- **Waiver** Any of the above conditions can be
[**waived**](http://creativecommons.org/licenses/by-sa/3.0/) if you get
permission from the copyright holder.
- **Public Domain** Where the work or any of its elements is in
the [**public domain**](http://wiki.creativecommons.org/Public_domain)
under applicable law, that status is in no way affected by the license.
- **Other Rights** In no way are any of the following rights
affected by the license:
- Your fair dealing or [**fair use**](http://wiki.creativecommons.org/Frequently_Asked_Questions#Do_Creative_Commons_licenses_affect_fair_use.2C_fair_dealing_or_other_exceptions_to_copyright.3F) rights, or other applicable copyright exceptions and limitations;
- The author's [**moral**](http://wiki.creativecommons.org/Frequently_Asked_Questions#I_don.E2.80.99t_like_the_way_a_person_has_used_my_work_in_a_derivative_work_or_included_it_in_a_collective_work.3B_what_can_I_do.3F)
rights;
- Rights other persons may have either in the work itself or in how the work is used, such as [**publicity**](http://wiki.creativecommons.org/Frequently_Asked_Questions#When_are_publicity_rights_relevant.3F) or privacy rights.
- **Notice** For any reuse or distribution, you must make clear
to others the license terms of this work by keeping together this work
and the current license.
This licence is based on [**http://creativecommons.org/licenses/by-sa/3.0/**](http://creativecommons.org/licenses/by-sa/3.0/).
# Guide to the exercises
The case study is designed for use with R version 3.3.0 or later.
## Nomenclature
| Formatting | Meaning |
|-------------------|-----------------------------|
| **mortality.dta** | Name of data set to be used |
| `cases` | Variable name |
| `##` | Indicates lines of R output |
Comments on analytical code are shown as bullet points following the code (following the output if the output is shown).
## Prerequisites
Participants are expected to have experience in working with
surveillance data, and to have some familiarity with data management and
multivariable analysis in R.
R packages are bundles of functions which extend the capability of R. Thousands of add-on packages are available in the main online repository (known as [CRAN](https://cran.r-project.org/)) and many more packages in development can be found on [GitHub](https://github.com/search?q=R&type=Repositories). They may be installed and updated over the Internet.
We will mainly use packages which come ready installed with R, but where it makes things easier we will use add-on packages. All the R packages you need for the exercises can be installed over the Internet with the following lines of code.
```{r install.packages, eval=FALSE}
required_packages <- c('broom', 'car', 'ggplot2', 'haven',
'ISOweek', 'lubridate', 'MASS', 'pander',
'readxl', 'reshape2', 'TSA', 'zoo')
install.packages(required_packages)
```
Run the following code at the beginning of each of the training days to make sure that you have made available all the packages and functions that you need. Be sure to include it in any scripts too.
```{r library}
# Packages required
required_packages <- c('broom', 'car', 'ggplot2', 'haven',
'ISOweek', 'lubridate', 'MASS', 'pander',
'readxl', 'reshape2', 'TSA', 'zoo')
for(i in seq(along = required_packages))
library(required_packages[i], character.only = TRUE)
# Function to create Stata weekly date
stataweekdate <- function(year, week){
(year - 1960) * 52 + week - 1
}
# Function to create Stata year and week numbers
statawofd <- function(date){
if(!is.Date(date)) stop('date should be a Date.')
dateposix <- as.POSIXlt(date)
dayofyear <- dateposix$yday
week <- floor(dayofyear/7) + 1
week[week %in% 53] <- 52
year <- dateposix$year + 1900
list(year=year, week=week)
}
# Function to tidy glm regression output
glmtidy <- function(x, caption=''){
pander(tidy(x, exponentiate=TRUE, conf.int=TRUE),
caption=caption)
}
# Function to tidy glm regression statistics
glmstats <- function(x){
pander(glance(x))
}
```
R and Stata have minor differences in default settings and methods. In this document we will follow the the Stata analysis as closely as possible, but small and usually unimportant differences may be noted between the statistical findings in R and those in Stata. At some points additional steps (which would usually be optional in R) will be taken to produce output which is comparable to that of Stata. For example, to follow the Stata practice of representing weekly dates in regression models as the number of weeks since the 1st January 1960, we provide the `stataweekdate` function above to calculate this from year and week.
The Stata `wofd` function converts dates to a definition of week which ensures 52 weeks in every year. The first week of the year always starts on 1st January and the last week of the year has 8 days (or 9 days in leap years). This is peculiar to Stata and has no direct equivalent in R or other analytical software, although it is possible to create your own R function to create Stata weeks (see the `statawofd` function above). It is often preferable to use another definition of week, such as [ISO week](https://en.wikipedia.org/wiki/ISO_week_date).
R can read in Stata .dta data sets using the `read_dta` command in the `haven` package.
The `glmtidy` function above formats R `glm` regression output more simply and will be used later on. The `glmstats` function tabulates key model-associated statistics.
R comes with a number of in-built data sets. Use `data()` to see what is available. Some of these data sets are time series; for an example, see what `help(Nile)` and `plot(Nile)` give you.
R can hold one or many data sets in memory simultaneously, so there is usually no need to save intermediate files or close and re-open data sets.
# Practical Session 1: Managing and plotting surveillance data
**Expected learning outcomes**
By the end of the session, participants should be able to:
- import surveillance data into R
- manage surveillance data sets with different date formats in R
- plot surveillance data against time
You have been provided with an Excel file (**TSA practice1_nov2016.xls**) with sheets called "dis1" (*Salmonella data*), "dis2" (measles in New York) and "dis3" (acute respiratory infection in Peru); and one Stata file called **puumala.dta** (Puumala virus infections). Make sure that these files are in your working directory, or that you have changed your working directory to where they are.
- If you are not sure where your current working directory is, you can find out using the `getwd()` command.
- To change your working directory, use the `setwd()` command, giving the full path. Paths can be given in two ways in R, with single forward slashes as in `setwd('C:/Users/paul.cleary/Desktop')` (preferred) or with doubled backslashes as in `setwd('C:\\Users\\paul.cleary\\Desktop')` (which will only work on Windows).
Import your data to R from these Excel files:
```{r}
salmo <- read_excel('TSA practice1_nov2016.xls', 'dis1')
summary(salmo)
```
```{r}
measles <- read_excel('TSA practice1_nov2016.xls', 'dis2')
summary(measles)
```
```{r}
ari <- read_excel('TSA practice1_nov2016.xls', 'dis3')
summary(ari)
```
- The `read_excel` command from the `readxl` package reads data from a specific Excel worksheet into memory. [^tibble]
- Run the command `library(help=readxl)` to learn more about the functions in the `readxl` package. How many functions/commands are in the `readxl` package?
- Use `?read_excel` to open the help file on the `read_excel` command.
- The `summary` command, when given a data frame, will provide a quick summary of each variable.
[^tibble]: Strictly speaking the `readxl` function reads data into a "tibble", which is a data frame modified to work with other packages by the same author. However this makes little practical difference here.
Start with the **salmo** data:
```{r viewsalmo, eval=FALSE}
View(salmo)
```
- The `View` command opens a window to show a specified data set. You can alternatively use the `head` command to view the first rows of data in the console.
There is one variable for the year and one for the week. We can convert it to the number of weeks since 1st January 1960 (for use in regression models, to get similar results to Stata) using the `stataweekdate` function provided above.
```{r}
salmo$date <- with(salmo, stataweekdate(year, week))
head(salmo$date, 10)
```
- The `head` command lists the first values of a variable, or the first rows of a data set. Here it is showing the first 10 values of the `date` variable. There is also a `tail` command to list the last values or rows.
You can see how the number of *Salmonella* cases is distributed in time using the `plot` command after telling R that this is time series data, as shown below.
```{r}
salmoz <- zooreg(salmo$cases, start=c(1981, 1), frequency=52)
plot(salmoz, ylab='Cases', main='Salmonella data')
```
- It is usually better to aggregate your own data from the dates, but if you are given aggregated data the above is the best way to create a time series in R.
- The `zooreg` command from the `zoo` package creates a regular time series; that is, a series of ordered observations at regularly-spaced intervals.
- To create a `zooreg` time series you specify the start date (here the first week of 1981) and that it is weekly data (if monthly data we would use `frequency=12` and if quarterly data you would use `frequency=4`).
- R has an in-built `ts` command for representing time series data which is suitable for many time series analyses, but for ease of data manipulation and handling of missing or duplicated data we recommend routinely using `zoo` time series.
- Note from the plot that there are missing values in the data. The `zoo` package offers several ways of handling these, such as dropping them, filling them with a default value or interpolating missing values from other data. See e.g. the `na.locf` function, which will fill a missing value with the last non-missing value ("last observation carried forward") or `na.approx` which will use linear interpolation (inferring missing values by drawing a straight line between the adjoining non-missing values).
If you prefer the `ggplot2` package, you can use this for `zoo` time series, e.g.:
```{r}
autoplot(salmoz) +
scale_x_continuous(breaks=1980:1990, minor_breaks=1) +
labs(x='Index', y='Cases', title='Salmonella data')
```
As the focus of this training is on understanding principles of time series analysis rather than visualisation of time series, we will mainly use the `plot` command for the remaining exercises.
Continue with the **ari** database:
```{r}
ariz <- zooreg(ari[, 3:5], start=c(2003, 1), frequency=52)
plot(ariz, main='Acute respiratory infection')
```
- Here we select columns 3 to 5 of the **ari** data set for this time series, which therefore contains three variables.
- Note how these are plotted separately. If you would like a single plot, use the command `plot(ariz, plot.type='single', col=1:3)`. (We have added a `col` option to give each of the lines a different colour).
- If you prefer to use `autoplot` instead of `plot`, you can plot multiple time series on a single plot with e.g. `autoplot(ariz, facets=NULL)`.
**NB: Note that the `dis3`/`ari` data contains multiple observations for each week and so requires aggregation before further analysis. The plot produced above is incorrect; you can see that the time series appears to extend into the future. Aggregation will be covered later.**
Now open **puumala.dta**. We can use the `read_dta` command from the R package `foreign` to read data stored by Stata.
Here you have one variable for the year, one variable for the month and one variable with the complete date in days, but in a string format.
```{r}
puumala <- read_dta('puumala.dta')
head(puumala)
```
```{r}
summary(puumala)
```
With the complete date you can generate ISO weeks, but first you need the date variable to be converted from text to something R recognises as a date.[^dmy]
```{r}
puumala$date_Date <- as.Date(puumala$date_str, '%d/%m/%Y')
head(puumala$date_Date, 10)
```
[^dmy]: The `lubridate` package has a number of convenient functions for converting text strings to dates. Here you could alternatively use: `puumala$date_Date <- dmy(puumala$date_str)`
- The `'%d/%m/%Y'` indicates to the `as.Date` function that the `date_str` variable contains a 2-digit day of month (01-31), a 2-digit month number (01-12) and a four-digit year, in that order, separated by forward slashes. Read the help for `strftime` to learn about different date formats (`?strftime`).
To convert to ISO week, we can use the `ISOweek` function from the `ISOweek` package,[^why_ISOweek] which creates a new variable representing the ISO week. We could then also create a new variable representing the first Monday of each ISO week.
[^why_ISOweek]: Although the popular `lubridate` package has an `isoweek` function (there is also a similar function in the `surveillance` package), we use the `ISOweek` package here as it has the `ISOweek2date` function. If epidemiological weeks are required, use the `EpiWeek` package.
```{r}
puumala$date_isowk <- ISOweek(puumala$date_Date)
puumala$isodate <- ISOweek2date(paste(puumala$date_isowk,
'-1', sep=''))
head(puumala)
```
- The `paste` command concatenates text.
- Here we are adding "-01" onto the end of the ISO week variable (which is formatted something like "1995-W01"), to indicate that we want the first day of that week, and then supplying that to the `ISOweek2date` function, which converts that to a date.
Have a look at the new variables that have been created. `date_isowk` is the ISO week variable, in string format, and `isodate` is the Monday of each ISO week. Note that the years 1998 and 2004 each have an ISO week 53.
You have several observations in the same week since you have data from different days and one value corresponds to one case. Use the `aggregate` command to aggregate the data.
```{r}
puumala$case <- 1
puumala2 <- aggregate(case ~ isodate, sum, data=puumala)
head(puumala2)
```
- The "formula" `case ~ isodate` indicates to the `aggregate` command that you want to sum cases by `isodate`.
### Optional: measles cases in New York
You have separate columns containing the counts. To plot this data, you need to first reshape your dataset to a single series of values, and then create a time series.
```{r}
measles2 <- as.vector(t(measles[-1]))
measlez <- zooreg(measles2, start=c(1928, 1), frequency=12)
plot(measlez, ylab='Cases', main='Measles in New York')
```
- The first line of code takes the measles data (minus the first column which contains the year), transposes it with the `t` function (i.e. the rows become columns) and then converts that to a single series of values (called a vector in R parlance).
## Mortality surveillance data in Spain: introduction to practical sessions 2 to 6
The Spanish daily mortality monitoring system gathers data from a stable
number of municipalities with computerised records of death; the system
is representative of the population of Spain and was developed in 2004
with the objective of identifying exceedances in mortality during the
summer period.
The Spanish National Centre for Epidemiology is responsible for the
system, which receives data from the Ministry of Justice, the National
Institute for Statistics and the Meteorological Agency.
Sessions 2 to 5 use data from this mortality surveillance system.
Session 6 uses data from the same system, but for only one region in
Spain (Aragon).
# Practical Session 2: Smoothing and trends
**Expected learning outcomes**
By the end of the session, participants are expected to be able to:
- Describe, test and fit a trend in a time series (simple smoothing and regression)
## Preparing your data for time-series analysis
Start a new R script, name it `trends.r` and save it in your working directory.
Write all commands in the R script so that you can run (and re-run) it when needed during the exercise.
Open the **mortality.dta** dataset:
```{r}
mort <- read_dta('mortality.dta')
head(mort)
```
If you want to remember which autonomous community these codes refer to, create a labelled variable.
```{r}
community <- c("Andalucía", "Aragón", "Asturias", "Baleares",
"Canarias", "Cantabria", "Castilla y León",
"Castilla-La Mancha", "Cataluña",
"Comunidad Valenciana", "Extremadura",
"Galicia", "Madrid", "Murcia", "Navarra",
"País Vasco", "La Rioja")
mort$community2 <- factor(mort$community, labels=community)
summary(mort)
```
- The `factor` command converts a variable to a "factor variable", which is a type of categorical variable which allows ordering and labelling of the variable. Factor variables are mainly useful for graphics and regression modelling.[^factor]
[^factor]: Further information on factors can be found [here](http://www.ats.ucla.edu/stat/r/modules/factor_variables.htm).
Note that you have repeated values, since you have data from different autonomous regions and from each sex. In other words, your dataset contains 17 lines per week and for males, and 17 lines per week and for females.
Use the `aggregate` command to get totals for each week, using the years and weeks given:
```{r}
mortagg <- aggregate(cases ~ week + year, sum, data=mort)
head(mortagg)
```
- The `sum` function adds up numbers.
- The `aggregate` command is here calculating the sum of cases by week and year.
Create a `zoo` time series from the total case counts and plot this.
```{r}
mortz <- zooreg(mortagg, start=c(2000, 1), frequency=52)
plot(mortz$cases, ylab='Cases', main='Mortality')
```
- As the `mortz` time series contains more than one variable, we specify which variable we want to plot using dollar notation.
## Moving averages
Moving averages are simple methods to visualise the general trend of a
series after removing some of the random day-to-day variability
("smoothing the data"). This allows you to examine your data for periodicity and observe the general trend. Smoothing the data can facilitate visual interpretation.
Moving averages replace (model) a value of a series by the arithmetical mean of nearby values. It is calculated for each observation, moving along the time axis. For example, at each time $t$, we may calculate the means of the 5 previous records.
For each record, create the following various types of moving average using the `rollapply` command from the `zoo` package.
- `MA5a`: the 5-week moving average, centred on cases
- `MA5b`: the 5-week moving average of cases and the 4 previous weeks
- `MA5c`: the 5-week moving average of the 5 previous weeks
Compare results:
```{r}
MA5a <- rollapply(mortz$cases, width=5, FUN=mean,
align='center')
MA5b <- rollapply(mortz$cases, width=5, FUN=mean,
align='right')
MA5c <- rollapply(mortz$cases, width=6,
FUN=function(x) mean(x[-6]),
align='right')
mortzma <- merge(cases=mortz$cases, MA5a, MA5b, MA5c)
head(mortzma)
```
```{r ma, fig.height=8}
plot(mortzma, plot.type='single', lty=1:4,
ylab='Cases', main='Moving averages')
grid()
legend('topright', c('Cases', 'MA5a', 'MA5b', 'MA5c'), lty=1:4,
title='Legend')
```
- The `rollapply` function applies functions to moving (or "rolling") windows of the data in the time series.
- The `width` argument gives the width of the moving window.
- The `align` argument indicates where the time point of interest is in the moving window. Most commonly we are interested in the most recent value as a function of preceding values, for which we specify `'right'`.
- We apply a more complicated function to get `MA5c`, taking a moving window of six values but omitting the latest value from the calculation of the average.
- We then use the `merge` command to combine the individual time series.
- The `grid` command adds gridlines at axis tick marks.
- The `legend` command adds a legend.
We observe that the calculation is similar across these various methods, but is
not aligned to the series in the same way. `MA5a` is centred
in the middle of the period used to calculate the mean. `MA5b` is placed
at the end of the period. `MA5c` is placed one step forward (smoothing
commands can be used for forecasting the following point). The "models"
provided are similar for a 5-week window, but the lag is different.
Moving average is only one way of smoothing. Other ways of smoothing
the data to get a general idea of the trend include, for
example, loess smoothing, where the contribution of surrounding
observations is weighted, i.e. it is not the arithmetical mean for each
set (window) of observations.
```{r}
scatter.smooth(mortz$cases, ylab='Mortality',
main='Loess smoothing')
```
- The `scatter.smooth` command gives a scatter plot of the time series, with a superimposed smooth curve obtained using LOESS. You can change the degree of smoothing by adding a `span` option.
- What does `scatter.smooth(mortz$cases, ylab='Mortality', main='Loess smoothing', span=0.1)` look like?
To better observe the general trend, we need to find the length of the
moving average that will erase the seasonal component. Various lengths
can be tried; here we have used 25, 51 and 103.
```{r}
MA25 <- rollapply(mortz$cases, width=25, FUN=mean,
align='right')
MA51 <- rollapply(mortz$cases, width=51, FUN=mean,
align='right')
MA103 <- rollapply(mortz$cases, width=103, FUN=mean,
align='right')
mortzma2 <- merge(cases=mortz$cases, MA25, MA51, MA103)
```
- You can use `View` to examine `mortzma2`.
```{r ma2, fig.height=8}
plot(mortzma2, plot.type='single', lty=1:4,
ylab='Cases', main='Moving averages')
grid()
legend('topright', c('Cases', 'MA25', 'MA51', 'MA103'), lty=1:3,
title='Legend')
```
Comment on the lines provided by the different smoothing windows.
Which one do you think is the best for eliminating seasonality? What would happen if you used an even greater window?
## Regression: linear trend
Using regression against time is a very simple way to look
at the trends and test the slope with the Wald test provided.
We will use the standard `lm` function to fit a linear regression model.
```{r}
mortz$Date <- with(mortagg, stataweekdate(year, week))
linregmodel <- lm(cases ~ Date, data=mortz)
names(linregmodel)
summary(linregmodel)
```
<!-- . regress cases date -->
<!-- Source | SS df MS Number of obs = 520 -->
<!-- -------------+------------------------------ F( 1, 518) = 18.39 -->
<!-- Model | 5491460.89 1 5491460.89 Prob > F = 0.0000 -->
<!-- Residual | 154673358 518 298597.217 R-squared = 0.0343 -->
<!-- -------------+------------------------------ Adj R-squared = 0.0324 -->
<!-- Total | 160164819 519 308602.735 Root MSE = 546.44 -->
<!-- ------------------------------------------------------------------------------ -->
<!-- cases | Coef. Std. Err. t P>|t| [95% Conf. Interval] -->
<!-- -------------+---------------------------------------------------------------- -->
<!-- date | .6845897 .1596354 4.29 0.000 .3709772 .9982022 -->
<!-- _cons | 3054.975 374.2351 8.16 0.000 2319.77 3790.181 -->
<!-- ------------------------------------------------------------------------------ -->
- To obtain regression results comparable to those of Stata, we calculate the number of weeks since 1st January 1960 (the `Date` variable).[^dig_at_stata]
- The `lm` command fits a linear regression model. You can use the `names` command to see the various components of the output of the `lm` command, as above.[^str] As with data frames, you use dollar notation to access individual components.
- Variables in R models are specified using notation along the lines of `responsevariable ~ explanatoryvariable1 + explanatoryvariable2` etc.[^wr]
- `summary`, when given the results of fitting a linear regression model, will provide a summary of the fitted trend and associated statistics.
[^dig_at_stata]: Do you think that change in mortality *per week* is really of interest? What else could you do?
[^str]: The `str` command is also often useful for looking at the "structure" of the output of an R function.
[^wr]: This is called Wilkinson-Rogers notation; see <http://www.physiol.ox.ac.uk/~raac/R.shtml> for more details. `~` is called a "tilde".
Identify and interpret the intercept and the trend.
Plot the fitted values against the observed ones.
```{r}
plot(mortz$cases, ylab='Mortality',
main='Mortality with fitted trend')
lines(as.vector(time(mortz)), fitted(linregmodel),
xlab='Index', col='green', lwd=2)
grid()
legend('topright', c('Data', 'Model'), col=c('black', 'green'), lwd=2)
```
- The `lines` command adds lines to an existing plot and requires $x$ and $y$ coordinates. `lwd=2` means double the line width.
- The `as.vector` function converts the dates of the time series into something the `lines` function can use as $x$ coordinates.
Could you have used a regression technique other than linear regression?
What are the advantages and disadvantages of using linear regression
when modelling numbers of cases of a disease?
Since we are dealing with counts, you could use Poisson regression (introduced later) or, to account for possible overdispersion, negative binomial regression. We can use the `glm.nb` function from the `MASS` package for this. To present the results we use the `glm.tidy` function, introduced later.
```{r}
nbmodel <- glm.nb(cases ~ Date, data=mortz)
glmtidy(nbmodel)
```
<!-- . nbreg cases date, irr -->
<!-- Fitting Poisson model: -->
<!-- Iteration 0: log likelihood = -18538.24 -->
<!-- Iteration 1: log likelihood = -18538.24 -->
<!-- Fitting constant-only model: -->
<!-- Iteration 0: log likelihood = -4911.9941 -->
<!-- Iteration 1: log likelihood = -4029.0268 -->
<!-- Iteration 2: log likelihood = -4003.1058 -->
<!-- Iteration 3: log likelihood = -4000.1564 -->
<!-- Iteration 4: log likelihood = -4000.1492 -->
<!-- Iteration 5: log likelihood = -4000.1492 -->
<!-- Fitting full model: -->
<!-- Iteration 0: log likelihood = -3990.4186 -->
<!-- Iteration 1: log likelihood = -3990.2314 -->
<!-- Iteration 2: log likelihood = -3990.2313 -->
<!-- Negative binomial regression Number of obs = 520 -->
<!-- LR chi2(1) = 19.84 -->
<!-- Dispersion = mean Prob > chi2 = 0.0000 -->
<!-- Log likelihood = -3990.2313 Pseudo R2 = 0.0025 -->
<!-- ------------------------------------------------------------------------------ -->
<!-- cases | IRR Std. Err. z P>|z| [95% Conf. Interval] -->
<!-- -------------+---------------------------------------------------------------- -->
<!-- date | 1.000148 .0000329 4.50 0.000 1.000083 1.000212 -->
<!-- _cons | 3293.31 254.0455 105.00 0.000 2831.203 3830.841 -->
<!-- -------------+---------------------------------------------------------------- -->
<!-- /lnalpha | -4.390631 .0629277 -4.513967 -4.267295 -->
<!-- -------------+---------------------------------------------------------------- -->
<!-- alpha | .0123929 .0007799 .0109549 .0140197 -->
<!-- ------------------------------------------------------------------------------ -->
Again we can plot the fitted values against the data.
```{r}
plot(mortz$cases, ylab='Mortality',
main='Mortality with fitted trend')
lines(as.vector(time(mortz)), fitted(nbmodel),
xlab='Index', col='green', lwd=2)
grid()
legend('topright', c('Data', 'Model'), col=c('black', 'green'), lwd=2)
```
## Help for optional task 2.3.1
To take the population into account, you can model it with Poisson regression (introduced later); first, you need to import the population data provided into your dataset. We will assume the population was steady each year and "jumped" to a new value on January 1st.
```{r}
mortz$pop <-
rep(c(39953520, 40688520, 41423520, 42196231, 42859172,
43662613, 44360521, 45236004, 45983169, 46367550),
each=52)
```
- We have used the `rep` function to repeat each of the population values for each of the 52 weeks in each year.
The "jump" effect has affected the predicted values, too:
```{r}
mortpoismodel <- glm(cases ~ offset(log(pop)) + Date, data=mortz,
family='poisson')
glmtidy(mortpoismodel)
```
<!-- . poisson cases date, irr exp(population) -->
<!-- Iteration 0: log likelihood = -17898.301 -->
<!-- Iteration 1: log likelihood = -17898.301 (backed up) -->
<!-- Poisson regression Number of obs = 520 -->
<!-- LR chi2(1) = 1694.74 -->
<!-- Prob > chi2 = 0.0000 -->
<!-- Log likelihood = -17898.301 Pseudo R2 = 0.0452 -->
<!-- ------------------------------------------------------------------------------ -->
<!-- cases | IRR Std. Err. z P>|z| [95% Conf. Interval] -->
<!-- -------------+---------------------------------------------------------------- -->
<!-- date | .9998236 4.28e-06 -41.17 0.000 .9998152 .999832 -->
<!-- _cons | .0001627 1.64e-06 -867.17 0.000 .0001596 .000166 -->
<!-- ln(popula~n) | 1 (exposure) -->
<!-- ------------------------------------------------------------------------------ -->
```{r}
plot(mortz$cases, ylab='Mortality',
main='Mortality with fitted trend')
lines(as.vector(time(mortz)), fitted(mortpoismodel),
xlab='Index', col='green', lwd=2)
grid()
legend('topright', c('Data', 'Model'), col=c('black', 'green'), lwd=2)
```
Because of the yearly jump, the overall trend is now negative. Overall, the predicted number of deaths still rises, because the yearly jump is bigger in absolute number than the yearly effect of the negative trend.
## Help for optional Task 2.3.2
In the `salmo` example, an exponential trend might fit the data better.
Generate a new variable `lcases` as the natural logarithm of cases:
```{r}
salmo$lcases <- log(salmo$cases)
salmoz2 <- zooreg(salmo, start=c(1981, 1), frequency=52)
```
Plot the logarithm of the number of cases according to the time:
```{r}
plot(salmoz2$lcases, ylab='Cases (natural log scale)',
main='Log Salmonella data')
```
Fit a model of `lcases` against date using linear regression.
```{r}
logmodel <- lm(lcases ~ date, data=salmoz2)
summary(logmodel)
```
<!-- . regress lcases date -->
<!-- Source | SS df MS Number of obs = 419 -->
<!-- -------------+------------------------------ F( 1, 417) = 796.23 -->
<!-- Model | 630.682171 1 630.682171 Prob > F = 0.0000 -->
<!-- Residual | 330.297732 417 .792080891 R-squared = 0.6563 -->
<!-- -------------+------------------------------ Adj R-squared = 0.6555 -->
<!-- Total | 960.979902 418 2.29899498 Root MSE = .88999 -->
<!-- ------------------------------------------------------------------------------ -->
<!-- lcases | Coef. Std. Err. t P>|t| [95% Conf. Interval] -->
<!-- -------------+---------------------------------------------------------------- -->
<!-- date | .0099544 .0003528 28.22 0.000 .009261 .0106479 -->
<!-- _cons | -9.83548 .4646589 -21.17 0.000 -10.74885 -8.922114 -->
<!-- ------------------------------------------------------------------------------ -->
Plot the log data and the model against time. Note that fitted values cannot be created where the number of cases is missing. Stata seems to do linear interpolation of the fitted values to fill in the gaps, so we will do that too.
```{r}
salmoz2$ltrend[!is.na(salmoz2$lcases)] <- fitted(logmodel)
salmoz2$ltrend <- na.approx(salmoz2$ltrend)
plot(salmoz2$lcases, ylab='Log Salmonella cases',
main='Log Salmonella data with fitted trend')
lines(salmoz2$ltrend,
xlab='Index', col='green', lwd=2)
grid()
legend('bottomright', c('Data', 'Model'), col=c('black', 'green'), lwd=2)
```
- We use `is.na` as above to correctly slot the fitted values back into the time series.
- We then use `na.approx`, mentioned above, to interpolate missing values.
Generate a new variable (`trend`), the antilog of the prediction (`ltrend`):
```{r}
salmoz2$trend <- exp(salmoz2$ltrend)
```
Plot the real data (`cases`) and this model (`trend`) according to time:
```{r}
plot(salmoz2$cases, ylab='Salmonella cases',
main='Salmonella data with fitted trend')
lines(salmoz2$trend,
xlab='Index', col='green', lwd=2)
grid()
legend('topleft', c('Data', 'Model'), col=c('black', 'green'), lwd=2)
```
Compare the results with those you would have got if you had used linear regression on the original data.
Alternatively, you can run a Poisson or a negative binomial regression to model the data:
```{r}
salmopoismodel <- glm(cases ~ date, data=salmoz2, family='poisson')
summary(salmopoismodel)
```
<!-- . poisson cases date -->
<!-- Iteration 0: log likelihood = -7352.4691 -->
<!-- Iteration 1: log likelihood = -7342.0468 -->
<!-- Iteration 2: log likelihood = -7342.0388 -->
<!-- Iteration 3: log likelihood = -7342.0388 -->
<!-- Poisson regression Number of obs = 419 -->
<!-- LR chi2(1) = 28234.54 -->
<!-- Prob > chi2 = 0.0000 -->
<!-- Log likelihood = -7342.0388 Pseudo R2 = 0.6579 -->
<!-- ------------------------------------------------------------------------------ -->
<!-- cases | Coef. Std. Err. z P>|z| [95% Conf. Interval] -->
<!-- -------------+---------------------------------------------------------------- -->
<!-- date | .0100045 .0000712 140.55 0.000 .009865 .010144 -->
<!-- _cons | -9.60511 .1019175 -94.24 0.000 -9.804865 -9.405355 -->
<!-- ------------------------------------------------------------------------------ -->
- There is more on using Poisson regression later on.
```{r}
salmoz2$trend2[!is.na(salmoz2$cases)] <- fitted(salmopoismodel)
salmoz2$trend2 <- na.approx(salmoz2$trend2)
plot(salmoz2$cases, ylab='Salmonella cases',
main='Salmonella data with fitted trend')
lines(salmoz2$trend2,
xlab='Index', col='green', lwd=2)
grid()
legend('topleft', c('Data', 'Model'), col=c('black', 'green'), lwd=2)
```
# Practical Session 3: Periodicity
**Expected learning outcomes**
By the end of this session, participants should be able to:
- assess the existence of periodicity in surveillance data
- fit and interpret models containing a trend and one or several sine and cosine curves on surveillance data to model both trend and periodicity
You have already identified the existence of a trend in this
surveillance data. You are now interested in detecting any periodicity.
Are there any cyclical patterns in the data?
## Generate a periodogram
R has a number of functions which produce a periodogram - we will use the `periodogram` function from the `TSA` package here.
```{r periodogram, fig.width=6}
mortper <- periodogram(mortz$cases)
```
- The `periodogram` command/function plots the estimated periodogram for the given time series and also returns various useful outputs as a list which we can use for other analyses.
- This type of plot is interpreted by identifying peaks. Convert the frequencies at which peaks occur to periods by taking the reciprocal.
```{r}
with(mortper,
1 / head(freq[order(-spec)], 3))
```
- The above line of code finds the frequencies with the three highest peaks in the periodogram and converts those frequencies to periods (which are numbers of weeks for this data).
- The `order` function is used here to order the frequencies of the periodogram in reverse order (i.e. largest first) of the values of the periodogram (don't miss out the negative sign).
To plot the periodogram more like the Stata `epergram` command does:
```{r periodgram2, fig.width=8}
with(mortper,
plot(1/freq, log(spec), type='l',
xlim=c(0, 160),
xlab='Period', ylab='Log(density)'))
```
- Note the `xlim` option which defines the range of the $x$ axis.
Is there any periodicity? If so, what is the period?
## Fitting a sine curve
As the periodogram shows periodicity close to 52 weeks, we will use a
sine curve of a 52-week period. Note that periodicity with a period of
one year is also referred to as seasonality.
Fit a linear regression of `cases` with a sine predictor term:
```{r}
mortz$sin52 <- sin(2 * pi * mortz$Date / 52)
sinemodel <- lm(cases ~ sin52, data=mortz)
summary(sinemodel)
```
<!-- . regress cases sin52 -->
<!-- Source | SS df MS Number of obs = 520 -->
<!-- -------------+------------------------------ F( 1, 518) = 63.74 -->
<!-- Model | 17547985.8 1 17547985.8 Prob > F = 0.0000 -->
<!-- Residual | 142616833 518 275322.072 R-squared = 0.1096 -->
<!-- -------------+------------------------------ Adj R-squared = 0.1078 -->
<!-- Total | 160164819 519 308602.735 Root MSE = 524.71 -->
<!-- ------------------------------------------------------------------------------ -->
<!-- cases | Coef. Std. Err. t P>|t| [95% Conf. Interval] -->
<!-- -------------+---------------------------------------------------------------- -->
<!-- sin52 | 259.7927 32.54122 7.98 0.000 195.8637 323.7217 -->
<!-- _cons | 4656.573 23.01012 202.37 0.000 4611.368 4701.778 -->
<!-- ------------------------------------------------------------------------------ -->
Plot the fitted values against the original data and comment.
```{r}
plot(mortz$cases, ylab='Mortality',
main='Regression model: one sine term')
lines(as.vector(time(mortz)), fitted(sinemodel),
xlab='Index', col='green', lwd=2)
legend('topright', c('Data', 'Model'), col=c('black', 'green'), lwd=2)
```
## Addition of the cosine
In order to have an appropriate phase in your model (you don't have to
worry about identifying it; this will happen automatically), you need to
use *both* a sine and a cosine curve with the same period. The sum of
these two curves gives the periodicity for the specified period and the
phase best describing our data.
You can fit a linear model for the cases, using sine and cosine as explanatory variables. Plot the fitted values against the data and comment:
```{r}
mortz$cos52 <- cos(2 * pi * mortz$Date / 52)
sinecosmodel <- lm(cases ~ sin52 + cos52, data=mortz)
summary(sinecosmodel)
```
<!-- . regress cases sin52 cos52 -->
<!-- Source | SS df MS Number of obs = 520 -->
<!-- -------------+------------------------------ F( 2, 517) = 325.63 -->
<!-- Model | 89285169.9 2 44642584.9 Prob > F = 0.0000 -->
<!-- Residual | 70879649.3 517 137097.968 R-squared = 0.5575 -->
<!-- -------------+------------------------------ Adj R-squared = 0.5557 -->
<!-- Total | 160164819 519 308602.735 Root MSE = 370.27 -->
<!-- ------------------------------------------------------------------------------ -->
<!-- cases | Coef. Std. Err. t P>|t| [95% Conf. Interval] -->
<!-- -------------+---------------------------------------------------------------- -->
<!-- sin52 | 259.7927 22.96301 11.31 0.000 214.6804 304.905 -->
<!-- cos52 | 525.2735 22.96301 22.87 0.000 480.1612 570.3858 -->
<!-- _cons | 4656.573 16.2373 286.78 0.000 4624.674 4688.472 -->
<!-- ------------------------------------------------------------------------------ -->
```{r}
plot(mortz$cases, ylab='Mortality',
main='Regression model: sine, cosine terms')
lines(as.vector(time(mortz)), fitted(sinecosmodel),
xlab='Index', col='green', lwd=2)
legend('topright', c('Data', 'Model'), col=c('black', 'green'), lwd=2)
```
How does the fit look graphically?
Now adjust not only for periodicity, but also for trend:
```{r}
sinecostrendmodel <- lm(cases ~ sin52 + cos52 + Date,
data=mortz)
summary(sinecostrendmodel)
```
<!-- . regress cases sin52 cos52 date -->
<!-- Source | SS df MS Number of obs = 520 -->
<!-- -------------+------------------------------ F( 3, 516) = 261.88 -->
<!-- Model | 96671544 3 32223848 Prob > F = 0.0000 -->
<!-- Residual | 63493275.2 516 123048.983 R-squared = 0.6036 -->
<!-- -------------+------------------------------ Adj R-squared = 0.6013 -->
<!-- Total | 160164819 519 308602.735 Root MSE = 350.78 -->
<!-- ------------------------------------------------------------------------------ -->
<!-- cases | Coef. Std. Err. t P>|t| [95% Conf. Interval] -->
<!-- -------------+---------------------------------------------------------------- -->
<!-- sin52 | 272.9587 21.82093 12.51 0.000 230.0899 315.8275 -->
<!-- cos52 | 526.0699 21.7549 24.18 0.000 483.3308 568.809 -->
<!-- date | .7963937 .1027901 7.75 0.000 .5944552 .9983322 -->
<!-- _cons | 2793.41 240.9689 11.59 0.000 2320.009 3266.811 -->
<!-- ------------------------------------------------------------------------------ -->
```{r}
plot(mortz$cases, ylab='Mortality',
main='Regression model: trend, sine, cosine terms')
lines(as.vector(time(mortz)), fitted(sinecostrendmodel),
xlab='Index', col='green', lwd=2)
legend('topright', c('Data', 'Model'), col=c('black', 'green'), lwd=2)
```
How does the fit look now graphically?