README.Rmd

---
output:
  md_document:
    variant: markdown_github
---

```{r, message = FALSE}
library(nzcrash)
library(dplyr)
library(tidyr)
library(magrittr)
library(stringr)
library(ggplot2)
library(scales)
library(lubridate)
```

# nzcrash

This package redistributes [crash
statistics](http://www.nzta.govt.nz/resources/crash-analysis-system-data/)
already available from the New Zealand Transport Agency, but in a more
convenient form.

It's a large package (over 20 megabytes, compressed).

## Datasets

The `crashes` dataset describes most facts about a crash.  The datasets `causes`,
`vehicles`, and `objects_struck` describe facts that are in a many-to-one
relationship with crashes.  They can be joined to the `crashes` dataset by the
common `id` column.  The `causes` dataset can additionally be joined to the
`vehicles` dataset by the combination of the `id` and `vehicle_id` columns.
This is most useful when the resulting table is also joined to the `crashes`
dataset.

## Up-to-date-ness

The data was last scraped from the NZTA website on `r Sys.Date()`.  At
that time, the NZTA had published data up to `r max(crashes$date)`.

```{r}
dim(crashes)
dim(causes)
dim(vehicles)
dim(objects_struck)
```

## Accuracy

The [NZTA](http://www.transport.govt.nz/research/roadtoll/#5), doesn't agree with [itself](http://www.transport.govt.nz/research/roadtoll/annualroadtollhistoricalinformation/) about recent annual road tolls, and this dataset gives a third opinion.

```{r}
crashes %>% 
  filter(severity == "fatal") %>%
  group_by(year = year(date)) %>%
  summarize(fatalities = sum(fatalities))
```

## Severity

Crashes categorised as "fatal", "serious", "minor" or "non-injury", based on the
casualties.  If there are any fatalities, then the crash is a "fatal" crash,
otherwise if there are any 'severe' injuries, the crash is a "serious" crash.

The definition of a 'severe' injury is not clear.

Minor and non-injury crashes are likely to be under-recorded since they often do
not involve the police, who write most of the crash reports upon which these
datasets are based.

A common mistake is to confuse the number of fatal crashes with the number of
fatalities.

```{r}
crashes %>% filter(severity == "fatal") %>% nrow
sum(crashes$fatalities)
```

## Dates and times

Three columns of the `crashes` dataset describe the date and time of the crash
in the NZST time zone (Pacific/Auckland).

* `date` gives the date without the time
* `time` gives the time where this is available, and NA otherwise.  Times are
  stored as date-times on the first of January, 1970.
* `datetime` gives the date and time in one value when both are available, and
  NA otherwise.  `date` is always available, however `time` is not.

When aggregating by some function of the date, e.g. by year, then always start
from the `date` column unless you also need the time.  This ensures against
accidentally discounting crashes where a time is not recorded.

```{r, fig.show = "hold"}
crashes %>%
  filter(is.na(time)) %>%
  count(year = year(date)) %>%
  ggplot(aes(year, n)) +
  geom_line() +
  ggtitle("Crashes missing\ntime-of-day information")

crashes %>%
  filter(is.na(time)) %>%
  count(year = year(date)) %>%
  mutate(percent = n/sum(n)) %>%
  ggplot(aes(year, percent)) +
  geom_line() +
  scale_y_continuous(labels = percent) +
  ggtitle("Percent of crashes missing\ntime-of-day information")
```

## Location coordinates

`r percent(nrow(filter(crashes, !is.na(easting)))/nrow(crashes))` of
crashes have coordinates.  These have been converted from the NZTM projection to
the WGS84 projection for convenience with packages like `ggmap`.

Because New Zealand is tall and skinny, you can easily spot the main population
centres with a simple histogram.

```{r}
crashes %>%
  ggplot(aes(northing)) +
  geom_histogram(binwidth = .1)
```

## Vehicles

There can be many vehicles in one crash, so vehicles are recorded in a separate
`vehicles` dataset that can be joined to `crashes` by the common `id` column.

```{r}
crashes %>%
  inner_join(vehicles, by = "id") %>%
  count(vehicle) %>% 
  arrange(desc(n))
```

## Objects struck

There can be many objects struck in one crash, so these are recorded in a separate
`objects_struck` dataset that can be joined to `crashes` by the common `id` column.

Q: What are more fatal, trees or lamp posts?

```{r}

crashes %>%
  inner_join(objects_struck, by = "id") %>%
  filter(object %in% c("Trees, shrubbery of a substantial nature"
                               , "Utility pole, includes lighting columns")
  , severity != "non-injury") %>% # non-injury crashes are poorly recorded
  count(object, severity) %>% 
  group_by(object) %>%
  mutate(percent = n/sum(n)) %>%
  select(-n) %>%
  spread(severity, percent)
```

A: Trees (Don't worry, I know it's harder than that.)

## Causes

Causes can be joined either to the `crashes` dataset (by the common `id`
column), or to the `vehicles` dataset (by both of the commont `id` and
`vehicle_id`) columns.

The main cause groups are given in the `causes_category` column.

```{r}
crashes %>%
  inner_join(causes, by = "id") %>%
  group_by(cause_category, id) %>%
  tally %>%
  group_by(cause_category) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  mutate(cause_category = factor(cause_category, levels = cause_category)) %>%
  ggplot(aes(cause_category, n)) + 
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))
```

That's odd -- where are speed, alcohol, and restraints?  They're given in `cause_subcategory`.

```{r}
causes %>% 
  filter(cause_subcategory == "Too fast for conditions") %>%
  count(cause) %>% 
  arrange(desc(n))
```

There's nothing there about speed limit violations, because it's impossible to tell what speed a
vehicle was going at when it crashed.

More worryingly, how is "Alcohol test below limit" a cause for a crash?
Hopefully they filter those out when making policy decisions.

```{r}
levels(causes$cause) <-                # Wrap facet labels
  str_wrap(levels(causes$cause), 13)
crashes %>%
  inner_join(causes, by = "id") %>%
  filter(cause_subcategory %in% c("Alcohol or drugs")) %>%
  group_by(cause, id) %>%
  tally %>%
  group_by(cause) %>%
  summarize(n = n()) %>%               # This extra step deals with many causes per crash
  arrange(desc(n)) %>%
  mutate(cause= factor(cause, levels = cause)) %>%
  ggplot(aes(cause, n)) + 
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))
rm(causes)                             # Because we messed up the factor levels
```

This time, join `causes` to both `vehicles` and `crashes` to assess the
drunken cyclist menace.

```{r}

crashes %>%
  filter(severity == "fatal") %>%
  select(id) %>%
  inner_join(vehicles, by = "id") %>% 
  filter(vehicle == "Bicycle") %>%
  inner_join(causes, by = c("id", "vehicle_id")) %>% 
  count(cause) %>%
  arrange(desc(n))
```

I think we all know what "Wandering or wobbling" means.