-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathplaying-tidyverse.Rmd
174 lines (114 loc) · 9.79 KB
/
playing-tidyverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "First steps with the tidyverse collection"
author: "Jean-Baka Domelevo Entfellner"
date: "Last updated on `r format(Sys.time(), '%d %B, %Y')`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, prompt = TRUE, comment = "", message = FALSE)
```
# Introduction to R packages
In R, **packages** are software modules providing extra functions, operators and datasets. Some packages, like `base` and `stats`, are part of the core R distribution, which means you don't need to install them separately, and they are already loaded in your environment upon starting your R session. When using RStudio, in the bottom right pane, you can see the available packages by clicking the `Packages` tab. The ones that are currently loaded in your environment appear with a tick mark. For those, you don't need to prepend the package name followed by a double colon when you call a function it implements (e.g. you can write `str_sub()` instead of `stringr::str_sub()` if and only if the `stringr` package is already loaded in your environment).
In order to load a package, one calls `library()` with the package name as argument. Quoting is unnecessary.
Before using a package for the first time, you need to download and install it on your computer by calling `install.packages()` with the quoted name of the package as argument.
# Introduction to the tidyverse collection
[tidyverse](https://www.tidyverse.org/) is a collection of open-source packages for R that were primarily developed by Hadley Wickham. Let us list the most common and broadly describe their scope:
1. `tibble` to work with *tibbles* instead of data frames: tibbles are "smart" data containers that display in a smarter way, with neater type control, etc.
2. `readr` to import tabular data from different text files (tab-delimited, csv, fixed-width fields, etc) into tibbles
3. `readxl` to import data from Excel files (both .xls and .xlsx) into tibbles
4. `tidyr` to transform data with different degrees of "messiness" into **tidy data**. Tidy data is simple to understand, but many datasets you come across are not "tidy". In tidy data, each variable is in its own column, and each observation (a.k.a "case" or "individual") is in its own row. `tidyr` helps you achieve tidy data and reshape datasets by collapsing or splitting columns, creating combinations of cases, handling missing values, etc
5. `magrittr` introduces additional pipe operators like `%>%` and `%<>%`
5. `dplyr` is for me **the central package** of the tidyverse collection. It's more than a pair of data pliers: it is a whole toolbox to extract information from a dataset, to group, summarize, join, apply functions to several variables, etc.
6. `ggplot2` implements a whole "grammar of graphics" and enables the user to create beautiful graphs of various types (scatterplots, histograms, jitter plots, violin plots, scatterplots, bargraphs, smoothed curves, 2D-density plots, etc), tweaking the way they are displayed through simple commands while maintaining a sense of graphical harmony.
7. `stringr` enables you to manipulate strings efficiently and effortlessly.
8. `purrr` is more advanced and deals with functional programming : programming by manipulating functions as objects
9. `forcats` is a re-implementation of R's classical factors (containers for categorical data).
All these packages constitute a huge programming effort and a considerable contribution
to the R ecosystem. I name them a "R 2.0". In order to install all of them at once, one goes:
```{r install tidyverse, eval=FALSE}
install.packages("tidyverse")
```
The above can take a little bit of time. Then we go:
```{r load tv}
library(tidyverse) # you could also load only the packages you want to use among those listed above, issuing one or several calls to library()
```
This opens up a whole world of possibilities, that we are going to explore in the remaining sections of this tutorial.
# The pipe operator
```{r pipe}
sum(3,7,2) # sum() accepts an arbitrary number of arguments and yields their sum
# can be written:
3 %>% sum(7,2)
```
The pipe operator `%>%` passes what is present on his left hand side as the first argument of the function call on its right hand side.
```{r pipe 2}
vec <- c(3,5,NA,7)
vec %>% sum(na.rm = TRUE) # same as sum(vec, na.rm = TRUE)
```
# Reading from files with readxl
We will import into an R object the content of the sheet called "HHcharacteristic" in the file `livestock_data.xlsx`. That is its second sheet. We write a call to the `read_xlsx()` function from the `readxl` package:
```{r read from xlsx, error = TRUE}
read_xlsx("livestock_data.xlsx", sheet = 2) # fails because readxl is actually not loaded by default when one loads tidyverse, so we go:
library(readxl)
read_xlsx("livestock_data.xlsx", sheet = 2) -> households # works fine
```
# Viewing a data frame or tibble
When you want to view the structure of a dataset, you can call the fundamental `str()` function on it:
```{r str}
str(households)
```
You can see that the data from the excel file has been imported correctly into a tibble (first line of the `str()` report) with 527 rows (observations) and 16 columns (variables). Following that, you get a snapshot of each variable (introduced by the `$` character for S3 objects, or the `@`character for S4 objects): its type, its length (in a data frame or tibble, all columns have the same height, being the number of observations) and its first elements (the list that finishes with the three dots).
Displaying tibbles by simply calling their name is not cumbersome (contrary to old-school `data.frame`s for which long outputs are cropped by a not-so-handsome message "[ reached 'max' / getOption("max.print") -- omitted xxx rows ]" while all the columns, however many, are displayed):
```{r displaying tibbles}
households
```
You can see the display fits the width of your screen, only the first 10 rows are displayed, and the additional variables (columns are hidden and summarised at the bottom in grey characters).
Finally, in RStudio, you can also use the `View()` function (be careful with the case!) to display a live view of the table as an additional tab in the top left (source editor) pane of RStudio:
```{r View a dataset}
View(households)
```
This view is not directly editable, but it is live, meaning that in case you change the dataset, the changes will reflect in the view. Notice you could also get to the same view by clicking on the said dataset within the `Environment` tab of the top right pane in RStudio.
# Simple summaries the old and new way
There is a column for sex (I guess, males and females). It is encoded as numeric and not as a qualitative variables with two values, so the summary with summary() is rather awkward:
```{r awkward summary}
summary(households$sex)
```
Note that the equivalent to the above but using pipes amounts to:
1. piping the whole dataset into a function that extracts *as a vector* its `sex` variable, and then
2. piping that vector into the `summary()` function.
Which is done as below:
```{r rewriting summary operation}
households %>% `$`(sex) %>% summary() # we quote the dollar operator with backticks
#or
households %>% `[[`("sex") %>% summary() # a tibble is also a data frame, hence a list of variables (columns), and the double-bracket operator extracts as a vector its named element
```
So, the summary is not very informative. With `dplyr`, the `summarize` operation (you can also write `summarise`, R accepts British English as well as American English) is meant to calculate a summary (single value) from one or several variables. What we want here is:
1. to *group* the dataset into groups according to the value of the "sex" variable
2. to display the pairs (value, count of observations) for each of the different encodings for the "sex" variable.
```{r group and summarize}
households %>% group_by(sex) %>% summarize(count = n())
```
See the excellent ["Data transformation with dplyr" cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf). That explains the above, which is what I was used to doing until now, but the recent versions of dplyr brought something simpler:
```{r count simpler}
households %>% count(sex)
```
To get such a neat table was possible in "the old R", of course:
```{r count table v1}
summary(as.factor(households$sex)) # transforming into qualitative data so that the summary() yields counts instead of the usual descriptive stats summary with quartiles
```
Alternatively:
```{r count table v2}
table(households$sex)
```
But you notice that in both the last two cases (with "the old R"), the result appears as a named vector, which is less tidy to work with than a true tibble.
# Filtering, slicing, grouping data: using pipes and the power of `dplyr`
Let's see some of the many features offered by the `dplyr` package when it comes to data wrangling.
We can filter some observations, producing a sub-table (still a tibble) containing only the observations matching logical criteria that we state. For example, if we want to build a tibble with only the households with at least 10 years of farming experience:
```{r filtering}
households %>% filter(hh_experience >= 10) # 339 observations
```
(please note an advantage of using the pipe versus the equivalent syntax `filter(households, hh_experience >=10)`: RStudio autocompletes for you the variable name when you are midway typing `hh_experience` and you hit the `Tab` key)
You can also filter the dataset based on a series of requirements, all to be fulfilled simultaneously (logical AND). Just make them several arguments to your call to `filter`. For instance, to select the the households with at least 10 years of farming experience **and** where the main person is a civil servant:
```{r filtering more}
households %>% filter(hh_experience >= 10, primary_activity == "Civil servant")
```
Only 21 observations remaining.