This repository has been archived by the owner on Apr 11, 2024. It is now read-only.
-
-
Notifications
You must be signed in to change notification settings - Fork 83
/
Copy pathtwelve_cities.Rmd
1118 lines (1022 loc) · 64.1 KB
/
twelve_cities.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Twelve Cities: Does lowering speed limits save pedestrian lives?"
author: "Jonathan Auerbach and Rob Trangucci"
date: "1/14/2017"
output:
pdf_document:
bibliography: twelve_cities.bib
---
```{r setup, include=FALSE}
library("knitr")
library("rstan")
library("reshape2")
library("tikzDevice")
library("GGally")
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores(),
tikzDefaultEngine = 'xetex')
theme_set(theme_light(base_size = 8))
crash_data <- read.csv("crash_data.csv")
```
##Summary of Analysis
We investigate whether American cities can expect to achieve a meaningful reduction in pedestrian deaths by lowering the posted speed limit. We present our work in three sections. First we briefly motivate the problem and provide a description of the dataset. Second we fit a log-linear model and compare sources of variation with an analysis of variance. Finally we demonstrate a sample use case. We evaluate the decision to lower many of New York City's posted speed limits from 30 mph to 25 mph. In our evaluation we assume the assignment of speed limits to roads is ignorable given measured covariates, and we calculate the number of lives saved by estimating the causal effect of lowering the speed limit on New York City roads from 30 mph to 25 mph on 25 mph roads. We find some evidence that a lower speed limit does in fact reduce fatality rates, and our estimated causal effect is similar to the traditional before-after analysis espoused by policy analysts. Nevertheless, we conclude that adjusting the posted speed limit in urban environments does not correspond with a reliable reduction in pedestrian fatalities.
We would like to thank both reviewers for their insightful comments and helpful suggestions.
##I. Introduction
###The Policy Question
Over the last few years, American cities have started to establish aggressive countermeasures that force road users to make safer decisions. These policies are collectively known as Vision Zero and have a stated goal of creating a road system with zero traffic fatalities.
In theory, Vision Zero is a comprehensive road investment strategy that prioritizes safety over vehicle mobility by anticipating human error and then slowing vehicles to the safest possible travel speed. Citywide road redesign and increased levels of enforcement and outreach are integral to a Vision Zero strategy [@tingvall2000vision]. These approaches are effective because they confine drivers, considerably reducing the chance that human error will result in a fatality. Yet such changes are costly to implement and challenging to sustain across an entire city. In practice, policymakers simply mandate that vehicles reduce their travel speed by lowering the posted speed limit below the speed for which the road was originally designed.
Lowering the speed limit is politically expedient since speed limit changes can be implemented immediately and at relatively little cost, and they can be sustained across an entire city at no additional expense. But lowering a speed limit without improvements to road design, enforcement and outreach may do little to reduce fatalities if drivers feel little pressure to comply with the lower limit [@leaf1999literature]. In fact, the National Highway Safety Traffic Administration (NHSTA) rates the countermeasure "reduce and enforce speed limits" three out of five stars for improving safety because research indicates that actual speed is reduced by only a fraction of the change in the posted speed limit [@goodwin2010countermeasures].
Twelve major American cities have officially set a Vision Zero strategy in response to vocal advocates: Chicago, San Francisco, New York City, Boston, Los Angeles, Austin, Portland, Seattle, San Jose, San Diego, Washington D.C. and Denver. The majority of these strategies include adjustments to many or all of the posted speed limits. For example, New York City lowered the default citywide speed limit from 30 to 25 mph in 2015. San Francisco uniformly set the speed limit around schools to 15 mph and is considering a default citywide speed limit of 20 mph.
The policy question is twofold. The first is descriptive: which aspects of roads drive pedestrian fatalities? The second is prescriptive: can a speed limit adjustment meaningfully reduce the number of pedestrian deaths? We analyze data from the twelve Vision Zero cities to investigate these questions. To our knowledge this is the first study to systematically examine Vision Zero data across multiple cities.
###The Data
Our dataset comprises every pedestrian death in the twelve major American cities with Vision Zero policies between 2010 and 2015 for which there was no missing covariate information. The unit of our analysis, however, is the immediate region or road segment on which each death occurred. We consequently view our dataset as a random sample of regions, and we obtain sampling weights equal to the reciprocal of the probability each region would be included in the sample.
The dataset has the following eight qualitative variables describing each road region: the weather and surface condition (COND), the city (CITY), the year (YEAR), the posted speed limit (SLIM), the presence of various signs or signals (SIGN), the time and lighting (LGHT), the physical road characteristics or built environment (BLTE) and the annual average traffic density (TFFC). These variables can be thought of as batches of categories that describe similar aspects of road regions. The dataset has two quantitative variables: the number of pedestrians exposed to the road region (EXPR) and the estimated number of road regions represented by the observation (WGHT). More information on the dataset can be found in the Appendix.
The first six observations of the dataset are displayed in Table 1 below. Table 2 shows the number of observations within the most common qualitative variable levels. Some of these variables are large, with COND, LGHT and BLTE having 25, 36 and 111 categories each. Throughout this analysis we present the qualitative variable levels in alphabetical order followed by the quantitative variables.
```{r summary, warning = FALSE, message = FALSE}
kable(round(head(crash_data[c(order(colnames(crash_data)[1:8]),9:10)]),2),
caption="Example Observations of Dataset")
kable(summary(apply(crash_data[,order(colnames(crash_data)[1:8])],2,factor)),
caption="Summary of Dataset")
```
Figure 1.1 plots a histogram of the observations by city. Figure 1.2 plots these cities by the most common speed limits, and Figure 1.3 plots them by both the most common speed limit and the last four years of the dataset. Cities are arranged from smallest population at the top to largest population at the bottom (as of the 2010 Census).
```{r explore, warning = FALSE, message = FALSE}
city_labels <- factor(crash_data$CITY,labels=c(
"Washington D.C.","Boston","Austin","Denver","Portland","Chicago",
"Seattle","Los Angeles","San Diego","San Francisco","San Jose",
"New York City"))
city_labels <- factor(city_labels,
levels=levels(city_labels)[c(12,8,6,9,11,3,10,7,4,1,2,5)])
crash_viz <- crash_data; crash_viz$city_labels <- city_labels
ggplot(crash_viz) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
aes(city_labels) +
geom_histogram(stat="count") +
coord_flip() +
scale_y_continuous(expand = c(0, 0)) +
labs(x="",y="",
title="Figure 1.1: Number of Observations per City",
caption = paste("This figure exhibits the number of deaths in each Vision Zero city",
"over the 2010 to 2015 sample period. Cities are arranged in increasing",
"order \nof population. The number of fatalities corresponds loosely",
"with population size."))
ggplot(crash_viz[crash_viz$SLIM %in% 5:8,]) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
aes(city_labels) +
geom_histogram(stat="count") +
coord_flip() +
labs(x="",y="") +
facet_wrap(~factor(SLIM,labels = c("20 MPH","25 MPH","30 MPH","35 MPH"))) +
scale_y_continuous(expand = c(0, 0)) +
labs(x="",y="",
title="Figure 1.2: Number of Observations per City by Posted Speed Limit",
caption = paste("This figure exhibits the number of deaths in each Vision Zero city",
"over the 2010 to 2015 sample period stratified by whether the posted",
"speed \nlimit on the road was 20, 25, 30 or 35 MPH. Cities are",
"arranged in increasing order of population. Speed limits of 25, 30 and",
"35 are observed \nfor every major city."))
ggplot(crash_viz[crash_viz$SLIM %in% 6:8 &
crash_viz$YEAR %in% 3:6,]) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
aes(city_labels) +
geom_histogram(stat="count") +
coord_flip() +
labs(x="",y="") +
facet_grid(factor(YEAR,labels = c("2012","2013","2014","2015"))~
factor(SLIM,labels = c("25 MPH","30 MPH","35 MPH"))) +
scale_y_continuous(expand = c(0, 0)) +
labs(x="",y="",
title="Figure 1.3: Number of Observations per City by Posted Speed Limit and Year",
caption = paste("This figure exhibits the number of deaths in each Vision Zero city",
"over the 2012 to 2015 sample period stratified by the year each",
"fatality took \nplace and whether the posted speed limit on the road",
"was 25, 30 or 35 MPH. Cities are arranged in increasing order of",
"population. Speed \nlimits of 25, 30 and 35 are not observed for every",
"major city in every year."))
```
The 2015 reduction in New York City's default speed limit from 30 to 25 mph is apparent in Figure 1.3. In this analysis, we construct a model for the years 2010-2014 and estimate the causal effect of this 2015 policy. Reserving the observations made during the year 2015 for testing our model serves two purposes. First, an independent testing set establishes stronger evidence that the model is predictive of the potential number of fatalities that would occur as a result of either speed limit policy. Second, the training set reflects available information before the policy was adopted, and consequently an analysis similar to ours could have been performed before the policy was implemented. Coincidentally, our results are comparable to a naïve version of the "before-after" analysis favored by policymakers.
##II. Model and Inference
Policymakers consider a wide variety of road characteristics when setting the posted speed limit. As a result, the relationship between pedestrian deaths and the speed limit is potentially confounded by any of the large number of qualitative variable categories in our dataset. We summarize the pairwise association of the eight qualitative variables using Cramer's V statistic in Figure 2. We find that no two covariates are completely associated although nearly every covariate exhibits moderate association. This suggests we should account for all eight qualitative variables in our initial model, and possibly some interactions.
```{r colinearity, warning = FALSE, message = FALSE, fig.align="center"}
cramer.v <- function(X1,X2)
sqrt(chisq.test(X1, X2)$statistic / ((length(X1) *
(min(length(unique(X1)),length(unique(X2))) - 1))))
cor_mat <- matrix(ncol = 8, nrow = 8,
dimnames = list(colnames(crash_data)[1:8],
colnames(crash_data)[1:8]))
for(i in 1:8) for(j in 1:8) cor_mat[i,j] <- cramer.v(crash_data[,i],crash_data[,j])
ggcorr(data=NULL,cor_matrix = cor_mat,
low = "steelblue", mid = "white", high = "darkred",
label = TRUE, label_size = 3, size = 3) + theme_void() +
theme(
plot.title = element_text(size=9, hjust = 0),
plot.caption = element_text(size = 7, hjust = 1),
legend.position = "none") +
labs(
title="Figure 2: Correlation of Qualitative Variables using Cramer's V",
caption = paste("This figure exhibits the marginal correlation",
"of the eight qualitative variables using Cramer's V statistic.",
"\nA value of 0 indicates that the two variables have little",
"association, and a value of 1 indicates that the \ntwo variables",
"are equal to each other. From this figure we observe that all of the",
"variables are \nmoderately associated. The largest association exists",
"between the posted speed limit, \ntraffic density and the built",
"environment."))
```
With all eight qualitative variables, the data constitute a high dimensional contingency table where each cell is a low, non-zero count. It is conventional to employ a log-linear model for such data. Causal estimates can then be calculated by imputing the values of missing cells representing unobserved counterfactual outcomes and then comparing those outcomes that address the question of interest. However, the validity of these estimates depends on a variety of assumptions, and we briefly highlight three major assumptions. See @gelman2006data or @imbens2015causal for a general discussion.
###Assumptions
We assume the assignment of posted speed limits to road regions is strongly ignorable given the observed covariates. This implies that road regions with similar covariate values (in a particular city, near a school, up a hill, etc.) are equally likely to be designated a certain speed limit. In particular, among two such road regions, the one with fewer pedestrian deaths cannot be more or less likely to post any speed limit.
This assumption would be violated if speed limit policy were based on an omitted variable that also influenced the fatality rate. For example, under Oregon state law, Portland can reduce speed limits to 20 mph provided additional signage indicating the presence of pedestrians is displayed [@essex2016transportation]. Failing to consider the presence of signs and signals would distort the estimated relationship of interest between the posted speed limit and pedestrian fatalities by inadvertently incorporating the relationship between signs and fatalities, which is not of explicit interest. Consultation with experts of speed limit policy informed our choice of covariates, and we believe we have accounted for all important information used by policymakers to set speed limits.
We assume no interference between the units of analysis, the road regions. This assumption would be violated if, for example, the fatality rate in any road region depended on the posted speed limit of a neighboring road region. While such interferences are possible, we anticipate meaningful interactions between urban roads to be rare since the factors leading to a collision are generally contained within a single road region.
We rely on the following zero-truncated, log-linear model to predict the counterfactual outcomes. Let $y_{ij}$ denote the $i^{th}$ death in the $j^{th}$ covariate strata. The joint probability distribution can be factorized:
\begin{align*}
\epsilon &\sim \text{ Normal}(0, \sigma_{\epsilon}) \\
\alpha_i^. &\sim \text{ Normal}(0, \sigma_i) \\
\bar{y}_{.j} &\sim \text{Poisson}^+( \text{exp}(\mu + \alpha^{\text{SLIM}}_1 +
\alpha^{\text{CITY}}_2 +
\alpha^{\text{YEAR}}_3 +
\alpha^{\text{COND}}_4 +
\alpha^{\text{SIGN}}_5\\
& \phantom{\sim \text{Poisson}^+( \text{exp }(\mu} +
\alpha^{\text{LGHT}}_6 +
\alpha^{\text{BLTE}}_7 +
\alpha^{\text{TFFC}}_8 +
\epsilon_j +
\beta \cdot
\text{log(} \text{EXPR}_{\cdot j})))
\end{align*}
\vspace{1mm}
To complete the specification of our first model (Model 1), we also put diffuse normal priors on $\mu$, $\beta$ and the $\sigma$. In the Comparison section we consider a second model (Model 2) where we add interactions between some of the explanatory variables.
###Computation
We sample from the posterior distribution of the parameters in the preceding model, excluding year 2015 from the dataset. Specifically, we use RStan [@RStan] to run four chains for 2,000 iterations each. This yields 4,000 posterior samples after discarding the first 1,000 iterations of each chain. In the Stan generated quantities block, we calculate the finite sample standard deviations of each qualitative variable. We also estimate the expected potential outcome for each road region in the dataset in the year 2015, first supposing the region had a 25 mph posted speed limit and then supposing it had a 30 mph posted speed limit. We use these estimates in the Criticism section to derive the posterior predictive distribution for model evaluation and in the Policy Analysis section to estimate the causal effect of New York City's default speed limit reduction. The corresponding Stan code can be found in the Appendix.
```{r stanrun, warning = FALSE, message = FALSE}
G <- 8
J <- apply(crash_data[,1:8],2,function(x) length(unique(x)))
stan_dat <- aggregate(cbind(1,EXPR)~.,crash_data,sum)
stan_dat <- stan_dat[order(stan_dat$YEAR),]
stan_dat_list <- with(stan_dat,
list(N_train = which.max(stan_dat$YEAR)-1,
N = nrow(stan_dat),
J = J,
G = G,
COND = COND,
CITY = CITY,
YEAR = YEAR,
SLIM = SLIM,
SIGN = SIGN,
LGHT = LGHT,
BLTE = BLTE,
TFFC = TFFC,
count = V1,
EXPR = EXPR))
model <- stan_model(file = 'model.stan')
fit <- sampling(model, data = stan_dat_list, iter = 2000,
chains = 4, refresh = 10, cores = 4)
```
###Analysis of Variance
Figures 3 and 4 display the Analysis of Variance using the posterior draws from the first model (Model 1) outlined above. Figure 3 shows the inner 50 and 95 percent quantiles of the finite sample standard deviation. The fat box represents the inner 50 percent range of the parameter, and the thin line represents the inner 95 percent range. We loosely interpret these regions as corresponding to the likely and possible values of the parameter respectively. Figure 3 demonstrates that time of day and lighting (LGHT) explain a significant portion of the pedestrian death rate. This makes sense as this variable captures the varying use of roads each day (i.e. commuting, tourism, dinner, the bar scene, etc.). City (CITY) and the built environment (BLTE) also explain much variation in pedestrian deaths. Surprisingly, there is little variation of deaths within cells (cell), years and levels of traffic (TFFC) suggesting that, through time of day and lighting, city and the built environment, we have adjusted for major sources of confounding.
```{r between, warning = FALSE, message = FALSE}
btwn <- c("COND_sd","CITY_sd","YEAR_sd",
"SLIM_sd","SIGN_sd","LGHT_sd",
"BLTE_sd","TFFC_sd","cell_sd")
coefs <- data.frame(extract(fit,pars=btwn))
coef_ggplot <- data.frame(coef_mean=apply(coefs,2,mean),btwn = btwn)
coef_ggplot$upper50 <- apply(coefs,2,quantile,probs=.75)
coef_ggplot$lower50 <- apply(coefs,2,quantile,probs=.25)
coef_ggplot$upper95 <- apply(coefs,2,quantile,probs=.975)
coef_ggplot$lower95 <- apply(coefs,2,quantile,probs=.025)
ggplot(coef_ggplot, aes(substr(btwn,1,4), coef_mean)) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
geom_linerange(aes(ymin = lower50, ymax = upper50),size=2) +
geom_linerange(aes(ymin = lower95, ymax = upper95)) +
coord_flip() +
labs(y=expression(hat(sigma)(.)),x="",
title="Figure 3: Model 1, Analysis of Variance, Between",
caption = paste("This figure exhibits inner 50 and 95 percent intervals for the",
"finite-population standard deviations of each batch of",
"explanatory variables. Time of day \nand lighting, city and the",
"built environment explain a substantial amount of variation in",
"the fatality rate. There is relatively more uncertainty around",
"\nthe importance of road surface condition and signs/signals.",
"Year, traffic density and speed limit account for relatively",
"little variation")) +
scale_y_continuous(limits=c(0,1.1*max(coef_ggplot$upper95)),
expand = c(0, 0))
```
The following four figures display the individual drivers of pedestrian deaths. Figures 4.1 and 4.2 show the 50 and 95% quantiles of the mean effects. Figures 4.3 and 4.4 show just the city effects and speed limit effects respectively. Figure 4.4 shows that pedestrians are, in general, at similar risk of death at 30 as compared with 25 MPH roads. Of course, this is not a comparison of the speed limit effects within the same stratum. We make this comparison in the Policy Analysis section.
```{r within, warning = FALSE, message = FALSE}
wthn <- c("COND_e","CITY_e","YEAR_e","SLIM_e",
"SIGN_e","LGHT_e","BLTE_e","TFFC_e")
coefs <- extract(fit,pars=wthn)
coef_ggplot <- data.frame(wthn = character(),
numb = character(),
coef_mean=numeric(),
upper50 = numeric(),
lower50 = numeric(),
upper95 = numeric(),
lower95 = numeric())
for(var in seq_along(coefs)){
coefs_temp <- coefs[[var]]
coef_ggplot_temp <- data.frame(wthn = wthn[var],
numb = paste0(wthn[var],1:ncol(coefs_temp)),
coef_mean=apply(coefs_temp,2,mean))
coef_ggplot_temp$upper50 <- apply(coefs_temp,2,quantile,probs=.75)
coef_ggplot_temp$lower50 <- apply(coefs_temp,2,quantile,probs=.25)
coef_ggplot_temp$upper95 <- apply(coefs_temp,2,quantile,probs=.975)
coef_ggplot_temp$lower95 <- apply(coefs_temp,2,quantile,probs=.025)
coef_ggplot <- rbind(coef_ggplot,coef_ggplot_temp)
}
ggplot(coef_ggplot[coef_ggplot$wthn != "BLTE_e",]) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
aes(numb, coef_mean,color=substr(wthn,1,4)) +
geom_linerange(aes(ymin = lower50, ymax = upper50),size=2) +
geom_linerange(aes(ymin = lower95, ymax = upper95)) +
coord_flip() +
scale_x_discrete(breaks = NULL) +
theme(legend.position = c(.85, 0.3)) +
scale_color_discrete(guide = guide_legend(reverse = TRUE)) +
labs(title="Figure 4.1: Model 1, Analysis of Variance, Within",
y=expression(hat(mu)(.)),x="",
color = "",
caption = paste("This figure exhibits inner 50 and 95 percent intervals for the",
"effects from seven batches of explanatory variables. Effects",
"are colored by batch and \nare interpreted as the log of the",
"expected multiplicative increase in the fatality rate compared",
"to the average road region. Time of day and \nlighting and",
"city explain a substantial amount of variation in the fatality",
"rate."))
ggplot(coef_ggplot[coef_ggplot$wthn =="BLTE_e",]) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
aes(numb, coef_mean) +
geom_linerange(aes(ymin = lower50, ymax = upper50),size=2, alpha = .5) +
geom_linerange(aes(ymin = lower95, ymax = upper95), alpha = .5) +
coord_flip() +
scale_x_discrete(breaks = NULL) +
theme(legend.position = "none") +
labs(title="Figure 4.2: Model 1, Analysis of Variance, Within",
y=expression(hat(mu)("BLTE")),x="",
color = "",
caption = paste("This figure exhibits inner 50 and 95 percent intervals for the",
"batch of built environment effects. Coefficients are",
"interpreted as the log of the expected \nmultiplicative",
"increase in the fatality rate compared to the average road",
"region."))
city_anova <- data.frame(coef_ggplot[coef_ggplot$wthn == "CITY_e",],
city_labels = c("Washington D.C.","Boston","Austin",
"Denver","Portland","Chicago",
"Seattle","Los Angeles","San Diego",
"San Francisco","San Jose",
"New York City"))
city_anova$city_labels <- factor(city_anova$city_labels,levels=levels(city_labels))
ggplot(city_anova) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
aes(city_labels, lower50 + (upper50 - lower50)/2) +
geom_linerange(aes(ymin = lower50, ymax = upper50),size=7) +
geom_linerange(aes(ymin = lower95, ymax = upper95)) +
geom_text(aes(label = city_labels),color="white",size=2) +
coord_flip() +
scale_x_discrete(breaks = NULL) +
theme(legend.position = "none") +
labs(title="Figure 4.3: Model 1, Analysis of Variance, Within",
y=expression(hat(mu)("CITY")),x="",
color = "",
caption = paste("This figure exhibits inner 50 and 95 percent intervals for the",
"batch of city effects. Coefficients are interpreted as the log",
"of the expected multiplicative \nincrease in the fatality rate",
"compared to the average road region. New York City and Los",
"Angeles have roughly twice the average fatality rate while",
"\nWashington D.C. has roughly one third, holding all else",
"equal."))
ggplot(data.frame(coef_ggplot[coef_ggplot$wthn == "SLIM_e",][5:9,],
slim_names = c("20 MPH","25 MPH","30 MPH",
"35 MPH","40 MPH"))) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
aes(numb, lower50 + (upper50 - lower50)/2) +
geom_linerange(aes(ymin = lower50, ymax = upper50),size=7) +
geom_linerange(aes(ymin = lower95, ymax = upper95)) +
geom_text(aes(label = slim_names),color="white",size=3) +
coord_flip() +
scale_x_discrete(breaks = NULL) +
theme(legend.position = "none") +
labs(title="Figure 4.4: Model 1, Analysis of Variance, Within",
y=expression(hat(mu)("SLIM")),x="",
color = "",
caption = paste("This figure exhibits inner 50 and 95 percent intervals for",
"some speed limit effects. Effects are interpreted as the log",
"of the expected multiplicative \nincrease in the fatality rate",
"compared to the average road region. With this model,",
"individual speed limit effects are not discernable."))
```
###Comparison
Figure 3 suggests dropping variables YEAR and TFF and adding the following interactions to the regression: CITYxSLIM, LGHTxSLIM, BLTExSLIM, CONDxSLIM, LGHTxCOND and LGHTxCONDxSLIM. Figure 5 shows the resulting ANOVA plot using the posterior draws from this second model (Model 2) after repeating the same steps outlined in the Computation section. The within cell error term is much smaller than in Figure 3, suggesting that the new covariates jointly explain more variation than the original model. In addition, Figure 6.1, unlike Figure 4.5, shows a slightly higher average fatality rate at 30 than 25 MPH roads. Such a difference is small but potentially meaningful.
```{r stanrun_int, warning = FALSE, message = FALSE,eval=FALSE}
crash_data_int <- crash_data
crash_data_int <- rbind(crash_data_int,crash_data_int[crash_data_int$YEAR==6,])
crash_data_int$SLIM_actual <- crash_data_int$SLIM
crash_data_int$SLIM[crash_data_int$YEAR==6] <-
sort(rep(c(6,7),sum(crash_data_int$YEAR == 6)/2))
for(int in c("CITYxSLIM","LGHTxSLIM","BLTExSLIM","CONDxSLIM","LGHTxCOND"))
crash_data_int[int] <- as.numeric(factor(interaction(
eval(parse(text=paste0("crash_data_int$",strsplit(int,"x")[[1]])[[1]])),
eval(parse(text=paste0("crash_data_int$",strsplit(int,"x")[[1]])[[2]]))),
labels = 1:length(unique(interaction(
eval(parse(text=paste0("crash_data_int$",strsplit(int,"x")[[1]])[[1]])),
eval(parse(text=paste0("crash_data_int$",strsplit(int,"x")[[1]])[[2]])))))
))
crash_data_int$LGHTxCONDxSLIM <- as.numeric(
factor(with(crash_data_int,interaction(SLIM,LGHT,COND)),
labels = 1:length(unique(with(crash_data_int,interaction(SLIM,LGHT,COND)))))
)
crash_data_int$TEST <- crash_data_int$YEAR == 6
crash_data_int <- crash_data_int[,c(1:2,4:7,9:18)]
G <- 12
J <- apply(crash_data_int[, c(1:6,10:15)], 2, function(x) length(unique(x)))
stan_dat_int <- aggregate(cbind(1,EXPR)~., crash_data_int, sum)
stan_dat_list <- with(stan_dat_int,
list(N_train = which.max(stan_dat_int$TEST)-1,
N = nrow(stan_dat_int),
J = J,
G = G,
COND = COND,
CITY = CITY,
SLIM = SLIM,
SIGN = SIGN,
LGHT = LGHT,
BLTE = BLTE,
CITYxSLIM = CITYxSLIM,
LGHTxSLIM = LGHTxSLIM,
BLTExSLIM = BLTExSLIM,
CONDxSLIM = CONDxSLIM,
LGHTxCOND = LGHTxCOND,
LGHTxCONDxSLIM = LGHTxCONDxSLIM,
count = V1,
EXPR = EXPR))
model_int <- stan_model(file = 'model_int.stan')
fit_int <- sampling(model_int, data = stan_dat_list, iter = 2000,
chains = 4, refresh = 10, cores = 4,
control = list(adapt_delta = 0.99))
```
```{r between_int, warning = FALSE, message = FALSE}
btwn <- c("COND_sd","CITY_sd","SLIM_sd","SIGN_sd","LGHT_sd","BLTE_sd",
"cell_sd","CITYxSLIM_sd", "LGHTxSLIM_sd","BLTExSLIM_sd",
"CONDxSLIM_sd","LGHTxCOND_sd","LGHTxCONDxSLIM_sd")
coefs <- data.frame(extract(fit_int,pars=btwn))
coef_ggplot <- data.frame(coef_mean=apply(coefs,2,mean),btwn = btwn)
coef_ggplot$upper50 <- apply(coefs,2,quantile,probs=.75)
coef_ggplot$lower50 <- apply(coefs,2,quantile,probs=.25)
coef_ggplot$upper95 <- apply(coefs,2,quantile,probs=.975)
coef_ggplot$lower95 <- apply(coefs,2,quantile,probs=.025)
coef_ggplot$btwn <- substr(coef_ggplot$btwn,1,nchar(as.character(coef_ggplot$btwn))-3)
ggplot(coef_ggplot, aes(btwn, coef_mean)) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
geom_linerange(aes(ymin = lower50, ymax = upper50),size=2) +
geom_linerange(aes(ymin = lower95, ymax = upper95)) +
coord_flip() +
scale_y_continuous(limits=c(0,1.1*max(coef_ggplot$upper95)),
expand = c(0, 0)) +
labs(y=expression(hat(sigma)(.)),x="",
title="Figure 5: Model 2, Analysis of Variance, Between",
caption = paste("This figure exhibits inner 50 and 95 percent intervals",
"for the finite-population standard deviations of each",
"batch of explanatory variables. \nCompared to Model 1,",
"Time of day and lighting and city remain strong",
"predictors of the fatality rate. Weather and surface",
"condition has \nincreased in importance. In addition,",
"interactions between the weather and surface condition",
"and speed limit and the built environment \nand speed",
"limit appear moderately predictive."))
wthn <- c("SLIM_e","COND_e","BLTExSLIM_e","CONDxSLIM_e")
coefs <- extract(fit_int,pars=wthn)
coef_ggplot <- data.frame(wthn = character(),
numb = character(),
coef_mean=numeric(),
upper50 = numeric(),
lower50 = numeric(),
upper95 = numeric(),
lower95 = numeric())
for(var in seq_along(coefs)){
coefs_temp <- coefs[[var]]
coef_ggplot_temp <- data.frame(wthn = wthn[var],
numb = paste0(wthn[var],1:ncol(coefs_temp)),
coef_mean=apply(coefs_temp,2,mean))
coef_ggplot_temp$upper50 <- apply(coefs_temp,2,quantile,probs=.75)
coef_ggplot_temp$lower50 <- apply(coefs_temp,2,quantile,probs=.25)
coef_ggplot_temp$upper95 <- apply(coefs_temp,2,quantile,probs=.975)
coef_ggplot_temp$lower95 <- apply(coefs_temp,2,quantile,probs=.025)
coef_ggplot <- rbind(coef_ggplot,coef_ggplot_temp)
}
ggplot(data.frame(coef_ggplot[coef_ggplot$wthn == "SLIM_e",][5:9,],
slim_names = c("20 MPH","25 MPH","30 MPH",
"35 MPH","40 MPH"))) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
aes(numb, lower50 + (upper50 - lower50)/2) +
geom_linerange(aes(ymin = lower50, ymax = upper50),size=7) +
geom_linerange(aes(ymin = lower95, ymax = upper95)) +
geom_text(aes(label = slim_names),color="white",size=3) +
coord_flip() +
scale_x_discrete(breaks = NULL) +
theme(legend.position = "none") +
labs(title="Figure 6.1: Model 2, Analysis of Variance, Within",
y=expression(hat(mu)("SLIM")),x="",
color = "",
caption = paste("This figure exhibits inner 50 and 95 percent intervals for",
"some speed limit effects. Effects are interpreted as the log",
"of the expected multiplicative \nincrease in the fatality rate",
"compared to the average road region. With this model, slight",
"average speed limit effects are observable although \nthese",
"differences are likely too small to be meaningful."))
```
While the average speed limit effect is small, Figures 6.3 and 6.4 display some meaningful interactions. For example, 25 mph roads regions appear safer than 30 mph road regions during cloudy weather and at large, four lane intersections.
```{r within_int, warning = FALSE, message = FALSE}
ggplot(data.frame(coef_ggplot[coef_ggplot$wthn == "COND_e",][c(2,3,9,15,18,23,24),],
cond_names = c("Clear/Dry", #weather is clear/road is dry
"Clear/Wet", #weather is clear/road is wet
"Rain/Wet", #weather is rain/road is wet
"Snow/Slush", #weather is snow/road is slush
"Fog/Dry", #weather is fog/road is dry
"Cloudy/Dry", #weather is cloudy/road is dry
"Cloudy/Wet" #weather is cloudy/road is wet
))) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
aes(cond_names, lower50 + (upper50 - lower50)/2) +
geom_linerange(aes(ymin = lower50, ymax = upper50),size=7) +
geom_linerange(aes(ymin = lower95, ymax = upper95)) +
geom_text(aes(label = cond_names),color="white",size=3) +
coord_flip() +
scale_x_discrete(breaks = NULL) +
theme(legend.position = "none") +
labs(title="Figure 6.2: Model 2, Analysis of Variance, Within",
y=expression(hat(mu)("COND")),x="",
color = "",
caption = paste("This figure exhibits inner 50 and 95 percent intervals for",
"some weather and surface condition effects. Effects are",
"interpreted as the log of the expected \nmultiplicative",
"increase in the fatality rate compared to the average road",
"region. Clear weather and dry roads have significantly larger",
"fatality rates \npossibly reflecting the fact that more",
"pedestrians are on roads and interacting with vehicles.",
"Rainstorms also appear to create hazardous \nconditions that",
"put pedestrians at increased risk."))
ggplot(data.frame(coef_ggplot[coef_ggplot$wthn == "CONDxSLIM_e",][
c(12,15,24,25,28,35,45,46),],
cond_names = c("Clear/Dry",
"Rain/Wet",
"Cloudy/Dry",
"Cloudy/Wet",
"Clear/Dry",
"Rain/Wet",
"Cloudy/Dry",
"Cloudy/Wet"
),
slim = c(rep("25 MPH",4),rep("30 MPH",4)))) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
aes(paste0(cond_names,slim), lower50 + (upper50 - lower50)/2, color = slim) +
geom_linerange(aes(ymin = lower50, ymax = upper50),size=7) +
geom_linerange(aes(ymin = lower95, ymax = upper95)) +
geom_text(aes(label = cond_names),color="white",size=3) +
coord_flip() +
scale_x_discrete(breaks = NULL) +
theme(legend.position = c(.8, 0.8)) +
scale_color_discrete(guide = guide_legend(reverse = TRUE)) +
labs(title="Figure 6.3: Model 2, Analysis of Variance, Within",
y=expression(hat(mu)("CONDxSLIM")),x="",
color = "",
caption = paste("This figure exhibits inner 50 and 95 percent intervals for",
"some interaction effects between the posted speed limit and",
"the weather and surface \ncondition. Effects are interpreted",
"as the log of the expected multiplicative increase in the",
"fatality rate compared to the average road region. \nLower",
"speed limits coincide with lower fatality rates during cloudy",
"weather possibly because pedestrians remain on the road but",
"driver \nvision is relatively obstructed. There does not",
"appear to be much difference between speed limits during",
"clear weather."))
ggplot(data.frame(coef_ggplot[coef_ggplot$wthn == "BLTExSLIM_e",][
c(26,28,57,59,105,107,138,140),],
cond_names = c("Two Lane Road",
"Two Lane Intersec",
"Four Lane Road",
"Four Lane Intersec",
"Two Lane Road",
"Two Lane Intersec",
"Four Lane Road",
"Four Lane Intersec"
),
slim = c(rep("25 MPH",4),rep("30 MPH",4)))) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
aes(paste0(cond_names,slim), lower50 + (upper50 - lower50)/2, color = slim) +
geom_linerange(aes(ymin = lower50, ymax = upper50),size=7) +
geom_linerange(aes(ymin = lower95, ymax = upper95)) +
geom_text(aes(label = cond_names),color="white",size=3) +
coord_flip() +
scale_x_discrete(breaks = NULL) +
theme(legend.position = c(.1, 0.8)) +
scale_color_discrete(guide = guide_legend(reverse = TRUE)) +
labs(title="Figure 6.4: Model 2, Analysis of Variance, Within",
y=expression(hat(mu)("BLTExSLIM")),x="",
color = "",
caption = paste("This figure exhibits inner 50 and 95 percent intervals for",
"some interaction effects between the posted speed limit and",
"the built environment. Effects \nare interpreted as the log",
"of the expected multiplicative increase in the fatality rate",
"compared to the average road region. Lower speed limits",
"\nappear to coincide with lower fatality rates across the",
"major road types with the largest decreases in intersections."))
```
###Criticism
We can check both of the preceding models by comparing the posterior predictions for the number of deaths on the 2015 sample roads with the actual number of deaths observed on those roads [@gelman2014bayesian]. This ensures the models do not over predict or under predict the number of deaths in each stratum. We use a p-value, the percentage of samples from the predictive distribution exceeding the observed number of deaths, to indicate the suitability of the models for prediction. Figures 7.1 and 7.2 demonstrate that the original model without interactions coincides with the observed data for 30 mph road regions but under predicts the observed data for 25 mph road regions. Figure 7.3 shows that the second model with interactions better fits the observed data. We will use this model in the Policy Analysis section.
```{r check, warning = FALSE, message = FALSE}
rpois_trunc <- function(log_rate){
rate <- exp(log_rate)
cum_cdf <- exp(-rate) / (1 - exp(-rate)) * rate
x <- 1; u <- runif(1)
while (cum_cdf < u) {
x <- x + 1;
cum_cdf <- cum_cdf * (1 + rate / x);
}
x
}
pred30 <- apply(extract(fit,c("mu_indiv_pred30"))[[1]], c(1,2), rpois_trunc)
city_30 <- apply(pred30[,stan_dat$SLIM[stan_dat$YEAR == 6] == 7],1,sum)
pvalue_30 <- paste("p-value:",round(sum(city_30 > sum(stan_dat$V1[
stan_dat$SLIM == 7 & stan_dat$YEAR == 6]))/length(city_30),2))
qplot(city_30) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
geom_vline(xintercept = sum(stan_dat$V1[stan_dat$SLIM == 7 &
stan_dat$YEAR == 6]),
linetype=2) +
geom_text(aes(label=pvalue_30),x=80,y=750) +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,50), limits = c(0,NA)) +
labs(title=
"Figure 7.1: Model 1, Posterior Draws of Deaths in 2015 on 30 mph roads (unweighted)",
x = "", y = "",
caption = paste("This figure exhibits a histogram of draws from the posterior",
"predictive distribution of the total number of fatalities on the",
"30 mph road regions in \n2015 set aside as a test set. The dotted",
"line represents the observed number of fatalities on those road",
"regions. The p-value is the proportion \nof simulations exceeding",
"the number observed in 2015. The purpose of this figure is to",
"ensure that the model is predictive of the fatality rate \nof the",
"roads that did not have their posted speed limits reduced to 25",
"mph. Model 1 appears consistent with the total number of",
"fatalities \non these road regions."))
pred25 <- apply(extract(fit,c("mu_indiv_pred25"))[[1]], c(1,2), rpois_trunc)
city_25 <- apply(pred25[,stan_dat$SLIM[stan_dat$YEAR == 6] == 6],1,sum)
pvalue_25 <- paste("p-value:",round(sum(city_25 > sum(stan_dat$V1[
stan_dat$SLIM == 6 & stan_dat$YEAR == 6]))/length(city_25),2))
qplot(city_25) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
geom_vline(xintercept = sum(stan_dat$V1[stan_dat$SLIM == 6 &
stan_dat$YEAR == 6]),
linetype=2) +
geom_text(aes(label=pvalue_25,x=200,y=500)) +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,50), limits = c(0,NA)) +
labs(title=
"Figure 7.2: Model 1, Posterior Draws of Deaths in 2015 on 25 mph roads (unweighted)",
x = "", y = "",
caption = paste("This figure exhibits a histogram of draws from the posterior",
"predictive distribution of the total number of fatalities on the",
"25 mph road regions in \n2015 set aside as a test set. The dotted",
"line represents the observed number of fatalities on those road",
"regions. The p-value is the proportion \nof simulations exceeding",
"the number observed in 2015. The purpose of this figure is to",
"ensure that the model is predictive of the fatality rate \nof the",
"roads that had their posted speed limits reduced to 25 mph.",
"Model 1 appears to systematically under predict the total number",
"of\n fatalities on these road regions."))
pred_int <- apply(extract(fit_int,c("mu_indiv_pred"))[[1]], c(1,2), rpois_trunc)
city_25_int <- apply(pred_int[,stan_dat_int$SLIM_actual[stan_dat_int$TEST == TRUE] == 6 &
stan_dat_int$SLIM[stan_dat_int$TEST == TRUE] == 6],1,sum)
pvalue_25_int <- paste("p-value:",round(sum(city_25_int > sum(stan_dat_int$V1[
stan_dat_int$SLIM_actual == 6 & stan_dat_int$TEST == TRUE &
stan_dat_int$SLIM == 6]))/length(city_25_int),2))
qplot(city_25_int) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
geom_vline(xintercept = sum(stan_dat_int$V1[stan_dat_int$SLIM_actual == 6 &
stan_dat_int$TEST == TRUE &
stan_dat_int$SLIM == 6]),
linetype=2) +
geom_text(aes(label=pvalue_25_int,x=200,y=500)) +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,50), limits = c(0,NA)) +
labs(title=
"Figure 7.3: Model 2, Posterior Draws of Deaths in 2015 on 25 mph roads (unweighted)",
x = "", y = "",
caption = paste("This figure exhibits a histogram of draws from the posterior",
"predictive distribution of the total number of fatalities on the",
"25 mph road regions in \n2015 set aside as a test set. The dotted",
"line represents the observed number of fatalities on those road",
"regions. The p-value is the proportion \nof simulations exceeding",
"the number observed in 2015. The purpose of this figure is to",
"ensure that the model is predictive of the fatality rate \nof the",
"roads that had their posted speed limits reduced to 25 mph. Model",
"2 appears consistent with the total number of fatalities on",
"these \nroad regions."))
```
##III. Policy Analysis
A naïve before and after study would compare the number of pedestrian deaths in New York City in 2015 with the number of pedestrian deaths in New York City in 2014 or the 2010-2014 average. These comparisons are extremely popular among policy analysts because they are simple to calculate and simple to interpret: the same units are compared before and after treatment. The use of before-after analysis in transportation research is discussed at length in @hauer2005cause and @hauer2015art.
The number of pedestrian deaths in New York City each year is displayed in Figure 8.1, and a before-after comparison would estimate four or six lives saved respectively. This estimate, however, fails to account for possible factors that influence the number of pedestrian deaths and that changed in tandem with speed limits. As a result, we cannot rule out the influence of additional policy changes or other external factors.
```{r before_after, warning = FALSE, message = FALSE}
ggplot(aggregate(V1 ~ YEAR,
stan_dat[stan_dat$CITY==12,],sum)) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
aes(YEAR+2009,V1) +
geom_line() +
scale_x_continuous(limits=c(2009.5,2015.5),
expand = c(0, 0)) +
scale_y_continuous(limits=c(0,200),expand = c(0,0)) +
labs(x = "", y = "",
title="Figure 8.1: Number of Pedestrian Deaths in NYC",
caption = paste("This figure exhibits the annual number of pedestrian deaths",
"in New York City over the study period. New York City had",
"roughly 140 pedestrian \ndeaths each year except for 2013 when",
"there were 168. This reflects a 20 percent increase. City",
"officials have attributed the fatality decline \nfrom 2013 to",
"2015 to Vision Zero policy."))
```
For example, the predictive importance of LGHT and COND that we identified in the ANOVA plots suggest that fatalities are largely driven by changes in weather, lighting and road surface conditions. Figure 8.2 shows the number of rainstorms in New York City relative to 2010 follows a similar trend to the fatality rate relative to 2010. If we interpret Figure 6.2 to suggest that rain storms increase the number of pedestrian deaths than Figure 8.2 implies that the number of pedestrian fatalities in New York City have actually increased after adjusting for the number of rain storms. Of course this is one of several plausible explanations for the trend of pedestrian fatalities. The purpose of Figure 8.2 is to simply motivate the necessity of accounting for major sources of confounding.
```{r explanation, warning = FALSE, message = FALSE}
yearly_rainstorm <- function(x) {
url <- paste0("https://www.wunderground.com/history/airport/jfk/",
x,"/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=",
x,"&req_city=NA&req_state=NA&req_statename=NA&format=1")
weather <- read.csv(url)
sum(grepl("Rain",weather$Events) & weather$Mean.Wind.SpeedMPH > 11.20402)
}
deaths <- aggregate(V1 ~ YEAR,stan_dat[stan_dat$CITY==12,],sum)
deaths$deaths_norm <- deaths$V1/deaths$V1[1]
deaths$rain_storm <- sapply(2010:2015,yearly_rainstorm)
deaths$rain_storm_norm <- deaths$rain_storm/deaths$rain_storm[1]
ggplot(deaths) +
theme(plot.title = element_text(size=9, hjust = .5),
plot.caption = element_text(hjust = 1)) +
geom_line(aes(YEAR+2009,deaths_norm)) +
geom_line(aes(YEAR+2009,rain_storm_norm), linetype=2) +
geom_text(x = 2014, y = 1.25, label = "Rain Storms") +
geom_text(x = 2014, y = .9, label = "Fatalities") +
scale_x_continuous(limits=c(2009.5,2015.5),
expand = c(0, 0)) +
scale_y_continuous(limits=c(0,1.5),expand = c(0,0)) +
labs(x = "", y = "",
title="Figure 8.2: A Plausible Explanation of Pedestrian Deaths in NYC",
caption = paste("This figure exhibits the annual number of pedestrian deaths",
"in New York City over the study period divided by the number",
"of deaths in 2010 \n(solid line) and the annual number of",
"rainstorms in New York City over the study period divided by",
"the number of rainstorms in 2010 \n(dashed line). While Figure",
"8.1 suggested that fatalities have fallen relative to 2013,",
"Figure 8.2 suggests that, relative to the \nnumber of rain",
"storms, fatalities have actually increased in 2015."))
```
We conclude by calculating the effect of New York City's speed limit reduction using the posterior predictions from both models in the previous section. Ignoring the sampling weights, the estimate corresponds to the set of road regions in New York City for which there was a death in 2015. Using the sampling weights, the estimate corresponds to all road regions in New York City. Table 3 shows that under Model 2, the most frequent outcome after lowering the speed limit is coincidentally similar to the before-after estimate of four or six.
```{r policy_evaluation, warning = FALSE, message = FALSE}
nyc_pred25 <- pred25[,stan_dat$SLIM[stan_dat$YEAR==6] == 6 &
stan_dat$CITY[stan_dat$YEAR==6] == 12]
nyc_pred30 <- pred30[,stan_dat$SLIM[stan_dat$YEAR==6] == 6 &
stan_dat$CITY[stan_dat$YEAR==6] == 12]
nyc_pred25_int <- pred_int[,stan_dat_int$SLIM_actual[stan_dat_int$TEST==TRUE] == 6 &
stan_dat_int$SLIM[stan_dat_int$TEST==TRUE] == 6 &
stan_dat_int$CITY[stan_dat_int$TEST==TRUE] == 12]
nyc_pred30_int <- pred_int[,stan_dat_int$SLIM_actual[stan_dat_int$TEST==TRUE] == 6 &
stan_dat_int$SLIM[stan_dat_int$TEST==TRUE] == 7 &
stan_dat_int$CITY[stan_dat_int$TEST==TRUE] == 12]
WGHT <- stan_dat$WGHT[stan_dat$SLIM == 6 &
stan_dat$CITY == 12 &
stan_dat$YEAR == 6]
WGHT_int <- stan_dat_int$WGHT[stan_dat_int$SLIM == 6 &
stan_dat_int$SLIM_actual == 6 &
stan_dat_int$CITY == 12 &
stan_dat_int$TEST == TRUE]
results <- rbind(
round(quantile(apply( nyc_pred25 - nyc_pred30, 1, sum),
c(.025,.25,.5,.75,.975)),2),
round(quantile(apply( nyc_pred25 - nyc_pred30, 1,
function(x,w) sum(x*w), w = WGHT),
c(.025,.25,.5,.75,.975)),2),
round(quantile(apply(nyc_pred25_int - nyc_pred30_int, 1, sum),
c(.025,.25,.5,.75,.975)),2),
round(quantile(apply(nyc_pred25_int - nyc_pred30_int, 1,
function(x,w) sum(x*w), w = WGHT_int),
c(.025,.25,.5,.75,.975)),2)
)
rownames(results) <- c("Model 1", "Model 1 (weighted)","Model 2","Model 2 (weighted)")
kable(results, caption="Estimated Number of Deaths due to Lowering Speed Limit
from 30 to 25 MPH on 25 MPH roads in NYC")
```
The estimated probability of witnessing an increase in deaths after the policy change is `r round(sum(apply(nyc_pred25_int - nyc_pred30_int, 1, function(x,w) sum(x*w), w = WGHT_int)> 0)/4000,2)` percent. We conclude that while the policy appears beneficial, there is too much uncertainty in the parameter estimates and thus too much variation in the predictive distribution, for changes in the number of pedestrian deaths each year to be meaningfully attributed to policy.
##IV. Appendix
###Data Sources
Our primary dataset is the Fatality Analysis Reporting System (FARS) collected by the National Highway Traffic Safety Administration. FARS contains a census of vehicle related deaths in the United States with a large number of covariates detailing the road, vehicle and persons involved. Road refers to the general area in which the death took place. This is generally a road segment or an intersection.
We supplement the [FARS](ftp://ftp.nhtsa.dot.gov/fars/) data with a few other sources. These sources are not included in the References section but their websites have been linked to here. We obtain sampling weights for each FARS observation using the [GES](ftp://ftp.nhtsa.dot.gov/GES). Missing New York City speed limit data in FARS is obtained from [NYC DOT](http://www.nyc.gov/html/dot/html/about/datafeeds.shtml) using the shapefiles of speed limits. The pedestrian population on each road was calculated following the
[Census Bureau](https://www.census.gov/hhes/commuting/data/calculations.html) using the [TPP](https://www.fhwa.dot.gov/planning/census_issues/ctpp/data_products/2006-2010_tract_flows/) and [PDB](http://www.census.gov/research/data/planning_database/). Average annual daily traffic was estimated from [HPMS](http://www.fhwa.dot.gov/policyinformation/hpms/shapefiles.cfm). The TFFC variable is discretized by the average number of vehciles per second.
###Stan Code
```{stan, output.var="model1"}
//Model 1
data {
int<lower=1> N_train;// number of training data points
int<lower=1> N; // number of total data points
int<lower=1> G; // number of groupings
int<lower=3> J[G]; // group sizes
int COND[N]; // index for surface condition/inclement weather
int CITY[N]; // index for city
int YEAR[N]; // index for year
int SLIM[N]; // index for posted speed limit
int SIGN[N]; // index for signs and signals i.e. school zone/work zone
int LGHT[N]; // index for light and time
int BLTE[N]; // index for built environment
int TFFC[N]; // index for traffic volume
vector[N] EXPR; // population exposed
int count[N]; // number of pedestrian deaths
}
transformed data {
vector[N] offset;
offset = log(EXPR);
}
parameters {
real offset_e;
vector[J[1]] COND_eta;
vector[J[2]] CITY_eta;
vector[J[3]] YEAR_eta;
vector[J[4]] SLIM_eta;
vector[J[5]] SIGN_eta;
vector[J[6]] LGHT_eta;
vector[J[7]] BLTE_eta;
vector[J[8]] TFFC_eta;
vector<lower=0>[G + 1] sds;
vector[N_train] cell_eta;
real mu;
}
transformed parameters {
vector[J[1]] COND_e;
vector[J[2]] CITY_e;
vector[J[3]] YEAR_e;
vector[J[4]] SLIM_e;
vector[J[5]] SIGN_e;
vector[J[6]] LGHT_e;
vector[J[7]] BLTE_e;
vector[J[8]] TFFC_e;
vector[N_train] cell_e;
vector[N_train] mu_indiv;
COND_e = sds[1] * COND_eta;
CITY_e = sds[2] * CITY_eta;
YEAR_e = sds[3] * YEAR_eta;
SLIM_e = sds[4] * SLIM_eta;
SIGN_e = sds[5] * SIGN_eta;
LGHT_e = sds[6] * LGHT_eta;
BLTE_e = sds[7] * BLTE_eta;
TFFC_e = sds[8] * TFFC_eta;
cell_e = sds[G + 1] * cell_eta;
for(n in 1:N_train)
mu_indiv[n] = mu + offset_e * offset[n]
+ COND_e[COND[n]]
+ CITY_e[CITY[n]]
+ YEAR_e[YEAR[n]]
+ SLIM_e[SLIM[n]]
+ SIGN_e[SIGN[n]]
+ LGHT_e[LGHT[n]]
+ BLTE_e[BLTE[n]]
+ TFFC_e[TFFC[n]]
+ cell_e[n];
}
model {
COND_eta ~ normal(0,1);
CITY_eta ~ normal(0,1);
YEAR_eta ~ normal(0,1);
SLIM_eta ~ normal(0,1);
SIGN_eta ~ normal(0,1);
LGHT_eta ~ normal(0,1);
BLTE_eta ~ normal(0,1);
TFFC_eta ~ normal(0,1);
cell_eta ~ normal(0,1);
offset_e ~ normal(0,1);
sds ~ normal(0, 1);
mu ~ normal(0, 10);
for(n in 1:N_train){
target += poisson_log_lpmf(count[n] | mu_indiv[n]);
target += -log1m_exp(-exp(mu_indiv[n]));
}
}
generated quantities {
real COND_sd;
real CITY_sd;
real YEAR_sd;
real SLIM_sd;
real SIGN_sd;
real LGHT_sd;
real BLTE_sd;
real TFFC_sd;
real cell_sd;
vector[N - N_train] mu_indiv_pred25;
vector[N - N_train] mu_indiv_pred30;
vector[N - N_train] cell_e_pred;
COND_sd = sd(COND_e);
CITY_sd = sd(CITY_e);
YEAR_sd = sd(YEAR_e);
SLIM_sd = sd(SLIM_e);
SIGN_sd = sd(SIGN_e);
LGHT_sd = sd(LGHT_e);
BLTE_sd = sd(BLTE_e);
TFFC_sd = sd(TFFC_e);
cell_sd = sd(cell_e);
for (n in 1:(N - N_train)){
cell_e_pred[n] = normal_rng(0, sds[G+1]);
mu_indiv_pred25[n] = mu + offset_e * offset[N_train + n]
+ COND_e[COND[N_train + n]]
+ CITY_e[CITY[N_train + n]]
+ YEAR_e[YEAR[N_train + n]]
+ SLIM_e[6]
+ SIGN_e[SIGN[N_train + n]]
+ LGHT_e[LGHT[N_train + n]]
+ BLTE_e[BLTE[N_train + n]]
+ TFFC_e[TFFC[N_train + n]]
+ cell_e_pred[n];
mu_indiv_pred30[n] = mu + offset_e * offset[N_train + n]
+ COND_e[COND[N_train + n]]
+ CITY_e[CITY[N_train + n]]
+ YEAR_e[YEAR[N_train + n]]
+ SLIM_e[7]
+ SIGN_e[SIGN[N_train + n]]
+ LGHT_e[LGHT[N_train + n]]
+ BLTE_e[BLTE[N_train + n]]
+ TFFC_e[TFFC[N_train + n]]
+ cell_e_pred[n];
}
}
```
```{stan, output.var="model2"}
//Model 2
data {
int<lower=1> N_train;// number of training data points
int<lower=1> N; // number of total data points
int<lower=1> G; // number of groupings
int<lower=3> J[G]; // group sizes
int COND[N]; // index for surface condition/inclement weather
int CITY[N]; // index for city
int SLIM[N]; // index for posted speed limit
int SIGN[N]; // index for signs and signals i.e. school zone/work zone
int LGHT[N]; // index for light and time
int BLTE[N]; // index for built environment
int CITYxSLIM[N];
int LGHTxSLIM[N];
int BLTExSLIM[N];
int CONDxSLIM[N];
int LGHTxCOND[N];
int LGHTxCONDxSLIM[N];
vector[N] EXPR; // population exposed
int count[N]; // number of pedestrian deaths
}
transformed data {
vector[N] offset;
offset = log(EXPR);
}
parameters {
real offset_e;
vector[J[1]] COND_eta;
vector[J[2]] CITY_eta;
vector[J[3]] SLIM_eta;
vector[J[4]] SIGN_eta;
vector[J[5]] LGHT_eta;
vector[J[6]] BLTE_eta;
vector[J[7]] CITYxSLIM_eta;
vector[J[8]] LGHTxSLIM_eta;
vector[J[9]] BLTExSLIM_eta;
vector[J[10]] CONDxSLIM_eta;
vector[J[11]] LGHTxCOND_eta;
vector[J[12]] LGHTxCONDxSLIM_eta;
vector<lower=0>[G + 1] sds;
vector[N_train] cell_eta;
real mu;
}
transformed parameters {
vector[J[1]] COND_e;
vector[J[2]] CITY_e;