-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
233 lines (183 loc) · 6.83 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
output:
md_document:
variant: markdown_github
---
```{r, message = FALSE}
library(nzcrash)
library(dplyr)
library(tidyr)
library(magrittr)
library(stringr)
library(ggplot2)
library(scales)
library(lubridate)
```
# nzcrash
This package redistributes [crash
statistics](http://www.nzta.govt.nz/resources/crash-analysis-system-data/)
already available from the New Zealand Transport Agency, but in a more
convenient form.
It's a large package (over 20 megabytes, compressed).
## Datasets
The `crashes` dataset describes most facts about a crash. The datasets `causes`,
`vehicles`, and `objects_struck` describe facts that are in a many-to-one
relationship with crashes. They can be joined to the `crashes` dataset by the
common `id` column. The `causes` dataset can additionally be joined to the
`vehicles` dataset by the combination of the `id` and `vehicle_id` columns.
This is most useful when the resulting table is also joined to the `crashes`
dataset.
## Up-to-date-ness
The data was last scraped from the NZTA website on `r Sys.Date()`. At
that time, the NZTA had published data up to `r max(crashes$date)`.
```{r}
dim(crashes)
dim(causes)
dim(vehicles)
dim(objects_struck)
```
## Accuracy
The [NZTA](http://www.transport.govt.nz/research/roadtoll/#5), doesn't agree with [itself](http://www.transport.govt.nz/research/roadtoll/annualroadtollhistoricalinformation/) about recent annual road tolls, and this dataset gives a third opinion.
```{r}
crashes %>%
filter(severity == "fatal") %>%
group_by(year = year(date)) %>%
summarize(fatalities = sum(fatalities))
```
## Severity
Crashes categorised as "fatal", "serious", "minor" or "non-injury", based on the
casualties. If there are any fatalities, then the crash is a "fatal" crash,
otherwise if there are any 'severe' injuries, the crash is a "serious" crash.
The definition of a 'severe' injury is not clear.
Minor and non-injury crashes are likely to be under-recorded since they often do
not involve the police, who write most of the crash reports upon which these
datasets are based.
A common mistake is to confuse the number of fatal crashes with the number of
fatalities.
```{r}
crashes %>% filter(severity == "fatal") %>% nrow
sum(crashes$fatalities)
```
## Dates and times
Three columns of the `crashes` dataset describe the date and time of the crash
in the NZST time zone (Pacific/Auckland).
* `date` gives the date without the time
* `time` gives the time where this is available, and NA otherwise. Times are
stored as date-times on the first of January, 1970.
* `datetime` gives the date and time in one value when both are available, and
NA otherwise. `date` is always available, however `time` is not.
When aggregating by some function of the date, e.g. by year, then always start
from the `date` column unless you also need the time. This ensures against
accidentally discounting crashes where a time is not recorded.
```{r, fig.show = "hold"}
crashes %>%
filter(is.na(time)) %>%
count(year = year(date)) %>%
ggplot(aes(year, n)) +
geom_line() +
ggtitle("Crashes missing\ntime-of-day information")
crashes %>%
filter(is.na(time)) %>%
count(year = year(date)) %>%
mutate(percent = n/sum(n)) %>%
ggplot(aes(year, percent)) +
geom_line() +
scale_y_continuous(labels = percent) +
ggtitle("Percent of crashes missing\ntime-of-day information")
```
## Location coordinates
`r percent(nrow(filter(crashes, !is.na(easting)))/nrow(crashes))` of
crashes have coordinates. These have been converted from the NZTM projection to
the WGS84 projection for convenience with packages like `ggmap`.
Because New Zealand is tall and skinny, you can easily spot the main population
centres with a simple histogram.
```{r}
crashes %>%
ggplot(aes(northing)) +
geom_histogram(binwidth = .1)
```
## Vehicles
There can be many vehicles in one crash, so vehicles are recorded in a separate
`vehicles` dataset that can be joined to `crashes` by the common `id` column.
```{r}
crashes %>%
inner_join(vehicles, by = "id") %>%
count(vehicle) %>%
arrange(desc(n))
```
## Objects struck
There can be many objects struck in one crash, so these are recorded in a separate
`objects_struck` dataset that can be joined to `crashes` by the common `id` column.
Q: What are more fatal, trees or lamp posts?
```{r}
crashes %>%
inner_join(objects_struck, by = "id") %>%
filter(object %in% c("Trees, shrubbery of a substantial nature"
, "Utility pole, includes lighting columns")
, severity != "non-injury") %>% # non-injury crashes are poorly recorded
count(object, severity) %>%
group_by(object) %>%
mutate(percent = n/sum(n)) %>%
select(-n) %>%
spread(severity, percent)
```
A: Trees (Don't worry, I know it's harder than that.)
## Causes
Causes can be joined either to the `crashes` dataset (by the common `id`
column), or to the `vehicles` dataset (by both of the commont `id` and
`vehicle_id`) columns.
The main cause groups are given in the `causes_category` column.
```{r}
crashes %>%
inner_join(causes, by = "id") %>%
group_by(cause_category, id) %>%
tally %>%
group_by(cause_category) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
mutate(cause_category = factor(cause_category, levels = cause_category)) %>%
ggplot(aes(cause_category, n)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))
```
That's odd -- where are speed, alcohol, and restraints? They're given in `cause_subcategory`.
```{r}
causes %>%
filter(cause_subcategory == "Too fast for conditions") %>%
count(cause) %>%
arrange(desc(n))
```
There's nothing there about speed limit violations, because it's impossible to tell what speed a
vehicle was going at when it crashed.
More worryingly, how is "Alcohol test below limit" a cause for a crash?
Hopefully they filter those out when making policy decisions.
```{r}
levels(causes$cause) <- # Wrap facet labels
str_wrap(levels(causes$cause), 13)
crashes %>%
inner_join(causes, by = "id") %>%
filter(cause_subcategory %in% c("Alcohol or drugs")) %>%
group_by(cause, id) %>%
tally %>%
group_by(cause) %>%
summarize(n = n()) %>% # This extra step deals with many causes per crash
arrange(desc(n)) %>%
mutate(cause= factor(cause, levels = cause)) %>%
ggplot(aes(cause, n)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))
rm(causes) # Because we messed up the factor levels
```
This time, join `causes` to both `vehicles` and `crashes` to assess the
drunken cyclist menace.
```{r}
crashes %>%
filter(severity == "fatal") %>%
select(id) %>%
inner_join(vehicles, by = "id") %>%
filter(vehicle == "Bicycle") %>%
inner_join(causes, by = c("id", "vehicle_id")) %>%
count(cause) %>%
arrange(desc(n))
```
I think we all know what "Wandering or wobbling" means.