-
Notifications
You must be signed in to change notification settings - Fork 31
/
Cassandra_C_Tidyverse_Create.Rmd
163 lines (90 loc) · 6.04 KB
/
Cassandra_C_Tidyverse_Create.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
title: "Cassandra Coste TidyVerse"
author: "Cassandra Coste"
date: 4/11/2021
output: html_document
---
```{r setup, include = FALSE, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
# Using nest, unnest, map, and tidy functions to model and compare nested data
### Nest and unnest - creating lists within dataframes and tidy data for modelling
This is an introduction to the nest and unnest functions found in the 'tidyr' package which is included in the tidyverse.
When you nest a data frame you create a column that contains a list of data frames. Nesting works as a summarizing function since you get one row for each group defined by the non-nested columns.
You can create nested data frames using tidyr::nest() or df %>% nest(x, y) specifies the columns to be nested.
When used in conjunction with the 'purr' and 'broom' packages you can apply operations to your lists of dataframes.
### Loading the libraries and the dataset
For this example I will be using several tidyverse packages including tidyr, magrittr, broom, and purrr and we will load these first.
```{r, warning=F}
library(tidyverse)
library(magrittr)
library(broom)
library(purrr)
```
Next we load the csv file format data that will be used for our examples.
```{r load data}
df <- as.data.frame(read.delim("https://raw.githubusercontent.com/cassandra-coste/CUNY607/main/data/world-happiness-report.csv", header = TRUE, stringsAsFactors = FALSE, sep = ",", fileEncoding = "UTF-8-BOM"))
```
### Setting up the model and demonstraing the tidy model format
First we start with an example with a single country, in this case Afghanistan, to demonstrate what we will ultimately want to do to all countries in the dataset. We will do this by filtering the dataset for the country name Afghanistan. Then we will run a simple linear model with the outcome variable life expectancy and the predictor variable of year. Finally, we can use the tidy function from the broom package to view the linear regression information in a tidy model format.
```{r country}
Afghanistan_by_year <- df %>% filter(Country.name == "Afghanistan")
Afghanistan_lm <- lm(Healthy.life.expectancy.at.birth ~ year , Afghanistan_by_year)
tidy(Afghanistan_lm)
```
### Created nested dataframes using nest function
To prepare to run our analysis on all countries, we can create nested dataframes. The below code will indicated that we want to nest all columns besides the country name column into a column named data. So for each country, there will be a dataframe containing the other 10 variables for that country.
First, we use the map function to identify NAs and see that for our outcome variable of interest, healthy life expectancy, we have 55 na values, which we will drop for the purpose of allowing our model to run.
Then we code to nest all variables except for the country name leaving us with the aforementioned data column to run our linear model on.
```{r nest}
#identify na values that may be an issue
map(df, ~sum(is.na(.)))
#drop na values in outcome variable column
by_country <- df %>% drop_na(Healthy.life.expectancy.at.birth)
#nest the dataframe by country
by_country %<>%
nest(data = !Country.name)
```
### Run models on nested dataframes using map function
Now that we have nested dataframes for each country, we can use the purrr package and the map function to run the linear regression for each country.
Map in general allows for you to apply an operation to each item in a list.
If you had a list a <- list(1, 2, 3, 4)
And used map to apply a operation of multiply by 2 using
map(a, ~ . * 2) it would return each item in the list "a" multiplied by 2 and return 2, 4, 6, 8, as seen below.
```{r}
a <- list(1, 2, 3, 4)
map(a, ~ . * 2)
```
Returning to the country's life expectancy example, we can use the map function to run simple linear regressions for each country and store it in a new column named model.
```{r}
# Use map to run the linear regression model for each country in the dataframe using the nested dataframes
by_country_model <- by_country %>% mutate(model = map(data, ~ lm(Healthy.life.expectancy.at.birth ~ year, data = .x)))
```
### Tidy models using tidy and unnest functions
To take this one step farther we can tidy our model column which contains lists, we use map again to and the tidy function to turn those lists into nested dataframes in a new column called tidy and finally use unnest on our tidied column so that we now can easily see the coefficients for the model run for each country
```{r}
# Here we run the same models as earlier but tidy and unnest the results
by_country_model <- by_country %>%
mutate(model = map(data, ~ lm(Healthy.life.expectancy.at.birth ~ year, data = .)), tidied = map(model, tidy))%>% unnest(tidied)
# View our tidied model results
head(by_country_model)
```
## Tidyverse extend
I really liked what Cassandra did with this data. She organized it in such a way that makes it prepped for some great analysis. I can leverage her work to further evaluate the data using the filter, arrange, and removeGrid functions for some intriguing data presentations.
```{r}
# perhaps this is arbitrary, but I'm going to focus on the "year" rows, so I'll start by filtering out intercepts
by_country_year <- by_country_model %>%
filter(term == "year")
by_country_year <- by_country_year %>% mutate_if(is.numeric, ~round(.,2))
# now let's arrange by the estimate column
df_country <- by_country_year %>% arrange(desc(by_country_year$estimate))
#subset our data to focus only on the top 20 of these countries as arranged above
df_country_20 <- df_country %>% slice(1:20)
```
Now let's visualize our subset
```{r pressure, echo=FALSE}
g <- ggplot(df_country_20, aes(x = Country.name, y = estimate)) +
geom_col(fill = "#0099f9") +
geom_text(aes(label = estimate), vjust = -0.5, size = 2)
g + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
```