-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy path05-manipulating_data-web.Rmd
710 lines (430 loc) · 31.7 KB
/
05-manipulating_data-web.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
# Manipulating data {#manipulating}
<!-- Please don't mess with the next few lines! -->
<style>h5{font-size:2em;color:#0000FF}h6{font-size:1.5em;color:#0000FF}div.answer{margin-left:5%;border:1px solid #0000FF;border-left-width:10px;padding:25px} div.summary{background-color:rgba(30,144,255,0.1);border:3px double #0000FF;padding:25px}</style>`r options(scipen=999)`<p style="color:#ffffff">`r intToUtf8(c(50,46,48))`</p>
<!-- Please don't mess with the previous few lines! -->
::: {.summary}
### Functions introduced in this chapter {-}
`read_csv`, `select`, `rename`, `rm`, `filter`, `slice`, `arrange`, `mutate`, `all.equal`, `ifelse`, `transmute`, `summarise`, `group_by`, `%>%`, `count`
:::
## Introduction {#manipulating-intro}
This tutorial will import some data from the web and then explore it using the amazing `dplyr` package, a package which is quickly becoming the *de facto* standard among R users for manipulating data. It's part of the `tidyverse` that we've already used in several chapters.
### Install new packages {#manipulating-install}
There are no new packages used in this chapter.
### Download the R notebook file {#manipulating-download}
Check the upper-right corner in RStudio to make sure you're in your `intro_stats` project. Then click on the following link to download this chapter as an R notebook file (`.Rmd`).
<a href = "https://vectorposse.github.io/intro_stats/chapter_downloads/05-manipulating_data.Rmd" download>https://vectorposse.github.io/intro_stats/chapter_downloads/05-manipulating_data.Rmd</a>
Once the file is downloaded, move it to your project folder in RStudio and open it there.
### Restart R and run all chunks {#manipulating-restart}
In RStudio, select "Restart R and Run All Chunks" from the "Run" menu.
### Load packages {#manipulating-load}
We load the `tidyverse` package as usual, but this time it is to give us access to the `dplyr` package, which is loaded alongside our other `tidyverse` packages like `ggplot2`. The `tidyverse` also has a package called `readr` that will allow us to import data from an external source (in this case, a web site).
```{r}
library(tidyverse)
```
## Importing CSV data {#manipulating-csv}
For most of the chapters, we use data sets that are either included in base R or included in a package that can be loaded into R. But it is useful to see how to get a data set from outside the R ecosystem. This depends a lot on the format of the data file, but a common format is a "comma-separated values" file, or CSV file. If you have a data set that is not formatted as a CSV file, it is usually pretty easy to open it in something like Google Spreadsheets or Microsoft Excel and then re-save it as a CSV file.
The file we'll import is a random sample from all the commercial domestic flights that departed from Houston, Texas, in 2011.
We use the `read_csv` command to import a CSV file. In this case, we're grabbing the file from a web page where the file is hosted. If you have a file on your computer, you can also put the file into your project directory and import it from there. Put the URL (for a web page) or the filename (for a file in your project directory) in quotes inside the `read_csv`command. We also need to assign the output to a tibble, so we've called it `hf` for "Houston flights".
```{r}
hf <- read_csv("https://vectorposse.github.io/intro_stats/data/hf.csv")
```
```{r}
hf
```
```{r}
glimpse(hf)
```
The one disadvantage of a file imported from the internet or your computer is that it does not come with a help file. (Only packages in R have help files.) Hopefully you have access to some kind of information about the data you're importing. In this case, we get lucky because the full Houston flights data set happens to be available in a package called `hflights`.
##### Exercise 1 {-}
Go to the help tab in RStudio and search for `hflights`. Of the several options that appear, click the one from the `hflights` package (listed as `hflights::hflights`). Review the help file so you know what all the variables mean. Report below how many cases are in the original `hflights` data. What fraction of the original data has been sampled in the CSV file we imported above?
::: {.answer}
Please write up your answer here.
:::
## Introduction to `dplyr` {#manipulating-dplyr}
The `dplyr` package (pronounced "dee-ply-er") contains tools for manipulating the rows and columns of tibbles. The key to using `dplyr` is to familiarize yourself with the "key verbs":
- `select` (and `rename`)
- `filter` (and `slice`)
- `arrange`
- `mutate` (and `transmute`)
- `summarise` (with `group_by`)
We'll consider these one by one. We won't have time to cover every aspect of these functions. More information appears in the help files, as well as this very helpful "cheat sheet": [https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf)
## `select` {#manipulating-select}
The `select` verb is very easy. It just selects some subset of variables (the columns of your data set).
The `select` command from the `dplyr` package illustrates one of the common issues R users face. Because the word "select" is pretty common, and selecting things is a common task, there are multiple packages that have a function called `select`. Depending on the order in which packages were loaded, R might get confused as to which version of `select` you want and try to apply the wrong one. One way to get the correct version is to specify the package in the syntax. Instead of typing `select`, we can type `dplyr::select` to ensure we get the version from the `dplyr` package. We'll do this in all future uses of the `select` function. (The other functions in this chapter don't cause us trouble because we don't use any other packages whose functions conflict like this.)
Suppose all we wanted to see was the carrier, origin, and destination. We would type
```{r}
hf_select <- dplyr::select(hf, UniqueCarrier, Origin, Dest)
hf_select
```
A brief but important aside here: there is nothing special about the variable name `hf_select`. I could have typed
`beef_gravy <- dplyr::select(hf, UniqueCarrier, Origin, Dest)`
and it would work just as well. Generally speaking, though, you want to give variables a name that reflects the intent of your analysis.
Also, **it is important to assign the result to a new variable**. If I had typed
`hf <- dplyr::select(hf, UniqueCarrier, Origin, Dest)`
this would have overwritten the original tibble `hf` with this new version with only three variables. I want to preserve `hf` because I want to do other things with the entire data set later. The take-home message here is this: **Major modifications to your data should generally be given a new variable name.** There are caveats here, though. Every time you create a new variable, you also fill up more memory with your creation. If you check your Global Environment, you'll see that both `hf` and `hf_select` are sitting in there. We'll have more to say about this in a moment.
Okay, back to the `select` function. The first argument of `select` is the tibble. After that, just list all the names of the variables you want to select.
If you don't like the names of the variables, you can change them as part of the select process.
```{r}
hf_select <- dplyr::select(hf,
carrier = UniqueCarrier,
origin = Origin,
dest = Dest)
hf_select
```
(Note here that I am overwriting `hf_select` which had been defined slightly differently before. However, these two versions of `hf_select` are basically the same object, so no need to keep two copies here.)
There are a few notational shortcuts. For example, see what the following do.
```{r}
hf_select2 <- dplyr::select(hf, DayOfWeek:UniqueCarrier)
hf_select2
```
```{r}
hf_select3 <- dplyr::select(hf, starts_with("Taxi"))
hf_select3
```
##### Exercise 2 {-}
What is contained in the new tibbles `hf_select2` and `hf_select3`? In other words, what does the colon (:) appear to do and what does `starts_with` appear to do in the `select` function?
::: {.answer}
Please write up your answer here.
:::
*****
The cheat sheet shows a lot more of these "helper functions" if you're interested.
The other command that's related to `select` is `rename`. The only difference is that `select` will throw away any columns you don't select (which is what you want and expect, typically), whereas `rename` will keep all the columns, but rename those you designate.
##### Exercise 3 {-}
Putting a minus sign in front of a variable name in the `select` command will remove the variable. Create a tibble called `hf_select4` that removes `Year`, `DayofMonth`, `DayOfWeek`, `FlightNum`, and `Diverted`. (Be careful with the unusual---and inconsistent!---capitalization in those variable names.) In the second part of the code chunk below, type `hf_select4` so that the tibble prints to the screen (just like in all the above examples).
::: {.answer}
```{r}
# Add code here to define hf_select4.
# Add code here to print hf_select4.
```
:::
## The `rm` command {#manipulating-rm}
Recall that earlier we mentioned the pros and cons of creating a new tibble every time we make a change. On one hand, making a new tibble instead of overwriting the original one will keep the original one available so that we can run different commands on it. On the other hand, making a new tibble does eat up a lot of memory.
One way to get rid of an object once we are done with it is the `rm` command, where `rm` is short for "remove". When you run the code chunk below, you'll see that all the tibbles we created with `select` will disappear from your Global Environment.
```{r}
rm(hf_select, hf_select2, hf_select3)
```
If you need one these tibbles back later, you can always go back and re-run the code chunk that defined it.
We'll use `rm` at the end of some of the following sections so that we don't use up too much memory.
##### Exercise 4 {-}
Remove `hf_select4` (that you created in Exercise 3) from the Global Environment.
::: {.answer}
```{r}
# Add code here to remove hf_select4.
```
:::
## `filter` {#manipulating-filter}
The `filter` verb works a lot like `select`, but for rows instead of columns.
For example, let's say we only want to see Delta flights. We use `filter`:
```{r}
hf_filter <- filter(hf, UniqueCarrier == "DL")
hf_filter
```
In the printout of the tibble above, if you can't see the `UniqueCarrier` column, click the black arrow on the right to scroll through the columns until you can see it. You can click "Next" at the bottom to scroll through the rows.
##### Exercise 5 {-}
How many rows did we get in the `hf_filter` tibble? What do you notice about the `UniqueCarrier` of all those rows?
::: {.answer}
Please write up your answer here.
:::
*****
Just like `select`, the first argument of `filter` is the name of the tibble. Following that, you must specify some condition. Only rows meeting that condition will be included in the output.
One thing that is unusual here is the double equal sign (`UniqueCarrier == "DL"`). This won't be a mystery to people with programming experience, but it tends to be a sticking point for the rest of us. A single equals sign represents assignment. If I type `x = 3`, what I mean is, "Take the letter x and assign it the value 3." In R, we would also write `x <- 3` to mean the same thing. The first line of the code chunk below assigns `x` to be 3. Therefore, the following line that just says `x` creates the output "3".
```{r}
x = 3
x
```
On the other hand, `x == 3` means something completely different. This is a logical statement that is either true or false. Either `x` is 3, in which case we get `TRUE` or `x` is not 3, and we get `FALSE`.
```{r}
x == 3
```
(It's true because we just assigned `x` to be 3 in the previous code chunk!)
In the above `filter` command, we are saying, "Give me the rows where the value of `UniqueCarrier` is `"DL"`, or, in other words, where the statement `UniqueCarrier == "DL"` is true.
As another example, suppose we wanted to find out all flights that leave before 6:00 a.m.
```{r}
hf_filter2 <- filter(hf, DepTime < 600)
hf_filter2
```
##### Exercise 6 {-}
Look at the help file for `hflights` again. Why do we have to use the number 600 in the command above? (Read the description of the `DepTime` variable.)
::: {.answer}
Please write up your answer here.
:::
*****
If we need two or more conditions, we use `&` for "and" and `|` for "or". The following will give us only the Delta flights that departed before 6:00 a.m.
```{r}
hf_filter3 <- filter(hf, UniqueCarrier == "DL" & DepTime < 600)
hf_filter3
```
Again, check the cheat sheet for more complicated condition-checking if needed.
##### Exercise 7(a) {-}
The symbol `!=` means "not equal to" in R. Use the `filter` command to create a tibble called `hf_filter4` that finds all flights *except* those flying into Salt Lake City ("SLC"). As before, print the output to the screen.
::: {.answer}
```{r}
# Add code here to define hf_filter4.
# Add code here to print hf_filter4.
```
:::
##### Exercise 7(b) {-}
Based on the output of the previous part, how many flights were there flying into SLC? (In other words, how many rows were removed from the original `hf` tibble to produce `hf_filter4`?)
::: {.answer}
Please write up your answer here.
:::
##### Exercise 8 {-}
Use the `rm` command to remove all the extra tibbles you created in this section with `filter`.
::: {.answer}
```{r}
# Add code here to remove all filtered tibbles.
```
:::
*****
The `slice` command is related, but fairly useless in practice. It will allow you to extract rows by position. So `slice(hf, 1:10)` will give you the first 10 rows. As a general rule, the information available in a tibble should never depend on the order in which the rows appear. Therefore, no function you run should make any assumptions about the ordering of your data. The only reason one might want to think about the order of data is for convenience in presenting that data visually for someone to inspect. And that brings us to...
## `arrange` {#manipulating-arrange}
This just re-orders the rows, sorting on the values of one or more specified columns. As I mentioned before, in most data analyses you work with summaries of the data that do not depend on the order of the rows, so this is not quite as interesting as some of the other verbs. In fact, since the re-ordering is usually for the visual benefit of the reader, there is often no need to store the output in a new variable. We'll just print the output to the screen.
```{r}
arrange(hf, ActualElapsedTime)
```
Scroll over to the `ActualElapsedTime` variable in the output above (using the black right arrow) to see that these are now sorted in ascending order.
##### Exercise 9 {-}
How long is the shortest actual elapsed time? Why is this flight so short? (Hint: look at the destination.) Which airline flies that route? You may have to use your best friend Google to look up airport and airline codes.
::: {.answer}
Please write up your answer here.
:::
*****
If you want descending order, do this:
```{r}
arrange(hf, desc(ActualElapsedTime))
```
##### Exercise 10 {-}
How long is the longest actual elapsed time? Why is this flight so long? Which airline flies that route? Again, you may have to use your best friend Google to look up airport and airline codes.
::: {.answer}
Please write up your answer here.
:::
##### Exercise 11(a) {-}
You can sort by multiple columns. The first column listed will be the first in the sort order, and then within each level of that first variable, the next column will be sorted, etc. Print a tibble that sorts first by destination (`Dest`) and then by arrival time (`ArrTime`), both in the default ascending order.
::: {.answer}
```{r}
# Add code here to sort hf first by Dest and then by ArrTime.
```
:::
##### Exercise 11(b) {-}
Based on the output of the previous part, what is the first airport code alphabetically and to what city does it correspond? (Use Google if you need to link the airport code to a city name.) At what time did the earliest flight to that city arrive?
::: {.answer}
Please write up your answer here.
:::
## `mutate` {#manipulating-mutate}
Frequently, we want to create new variables that combine information from one or more existing variables. We use `mutate` for this. For example, suppose we wanted to find the total time of the flight. We might do this by adding up the minutes from several variables: `TaxiOut`, `AirTime`, and `TaxiIn`, and assigning that sum to a new variable called `total`. Scroll all the way to the right in the output below (using the black right arrow) to see the new `total` variable.
```{r}
hf_mutate <- mutate(hf, total = TaxiOut + AirTime + TaxiIn)
hf_mutate
```
As it turns out, that was wasted effort because that variable already exists in `ActualElapsedTime`. The `all.equal` command below checks that both specified columns contain the exact same values.
```{r}
all.equal(hf_mutate$total, hf$ActualElapsedTime)
```
Perhaps we want a variable that just classifies a flight as arriving late or not. Scroll all the way to the right in the output below to see the new `late` variable.
```{r}
hf_mutate2 <- mutate(hf, late = (ArrDelay > 0))
hf_mutate2
```
This one is a little tricky. Keep in mind that `ArrDelay > 0` is a logical condition that is either true or false, so that truth value is what is recorded in the `late` variable. If the arrival delay is a positive number of minutes, the flight is considered "late", and if the arrival delay is zero or negative, it's not late.
If we would rather see more descriptive words than `TRUE` or `FALSE`, we have to do something even more tricky. Look at the `late` variable in the output below.
```{r}
hf_mutate3 <- mutate(hf,
late = as_factor(ifelse(ArrDelay > 0,
"Late", "On time")))
hf_mutate3
```
The `as_factor` command tells R that `late` should be a categorical variable. Without it, the variable would be a "character" variable, meaning a list of character strings. It won't matter for us here, but in any future analysis, you want categorical data to be treated as such by R.
The main focus here is on the `ifelse` construction. The `ifelse` function takes a condition as its first argument. If the condition is true, it returns the value in the second slot, and if it's false (the "else" part of if/else), it returns the value in the third slot. In other words, if `ArrDelay > 0`, this means the flight is late, so the new `late` variable should say "Late"; whereas, if `ArrDelay` is not greater than zero (so either zero or possibly negative if the flight arrived early), then the new variable should say "On Time".
Having said that, I would generally recommend that you leave these kinds of variables as logical types. It's much easier to summarize such variables in R, namely because R treats `TRUE` as 1 and `FALSE` as 0, allowing us to do things like this:
```{r}
mean(hf_mutate2$late, na.rm = TRUE)
```
This gives us the proportion of late flights.
Note that we needed `na.rm` as you've seen in previous chapter. For example, look at the 93rd row of the tibble:
```{r}
slice(hf_mutate2, 93)
```
Notice that all the times are missing. There are a bunch of rows like this. Since there is not always an arrival delay listed, the `ArrDelay` variable doesn't always have a value, and if `ArrDelay` is `NA`, the `late` variable will be too. So if we try to calculate the mean with just the `mean` command, this happens:
```{r}
mean(hf_mutate2$late)
```
##### Exercise 12 {-}
Why does taking the mean of a bunch of zeros and ones give us the proportion of ones? (Think about the formula for the mean. What happens when we take the sum of all the zeros and ones, and what happens when we divide by the total?)
::: {.answer}
Please write up your answer here.
:::
##### Exercise 13 {-}
Create a new tibble called `hf_mutate4` that uses the `mutate` command to create a new variable called `dist_k` which measures the flight distance in kilometers instead of miles. (Hint: to get from miles to kilometers, multiply the distance by 1.60934.) Print the output to the screen.
::: {.answer}
```{r}
# Add code here to define hf_mutate4.
# Add code here to print hf_mutate4.
```
:::
*****
A related verb is `transmute`. The only difference between `mutate` and `transmute` is that `mutate` creates the new column(s) and keeps all the old ones too, whereas `transmute` will throw away all the columns except the newly created ones. This is not something that you generally want to do, but there are exceptions. For example, if I was preparing a report and I needed only to talk about flights being late or not, it would do no harm (and would save some memory) to throw away everything except the `late` variable.
Before moving on to the next section, we'll clean up the extra tibbles lying around. You'll need to manually click the run button in the next code chunk since you have defined `hf_mutate4`.
```{r}
rm(hf_mutate, hf_mutate2, hf_mutate3, hf_mutate4)
```
## `summarise` (with `group_by`) {#manipulating-summ-group}
First, before you mention that `summarise` is spelled wrong...well, the author of the `dplyr` package is named Hadley Wickham (same author as the `ggplot2` package) and he is from New Zealand. So that's the way he spells it. He was nice enough to include the `summarize` function as an alias if you need to use it 'cause this is 'Murica!
The `summarise` function, by itself, is kind of boring, and doesn't do anything that couldn't be done more easily with base R functions.
```{r}
summarise(hf, mean(Distance))
```
```{r}
mean(hf$Distance)
```
Where `summarise` shines is in combination with `group_by`. For example, let's suppose that we want to see average flight distances, but broken down by airline. We can do the following:
```{r}
hf_summ_grouped <- group_by(hf, UniqueCarrier)
hf_summ <- summarise(hf_summ_grouped, mean(Distance))
hf_summ
```
### Piping {#manipulating-piping}
This is a good spot to introduce a time-saving and helpful device called "piping", denoted by the symbol `%>%`. We've seen this weird combination of symbols in past chapters, but we haven't really explained what they do.
Piping always looks more complicated than it really is. The technical definition is that
`x %>% f(y)`
is equivalent to
`f(x, y)`.
As a simple example, we could add two numbers like this:
```{r}
sum(2, 3)
```
Or using the pipe, we could do it like this:
```{r}
2 %>% sum(3)
```
All this is really saying is that the pipe takes the thing on its left, and plugs it into the first slot of the function on its right. So why do we care?
Let's revisit the combination `group_by`/`summarise` example above. There are two ways to do this without pipes, and both are a little ugly. One way is above, where you have to keep reassigning the output to new variables (in the case above, to `hf_summ_grouped` and then `hf_summ`). The other way is to nest the functions:
```{r}
summarise(group_by(hf, UniqueCarrier), mean(Distance))
```
This requires a lot of brain power to parse. In part, this is because the function is inside-out: first you group `hf` by `UniqueCarrier`, and then the result of that is summarized. Here's how the pipe fixes it:
```{r}
hf %>%
group_by(UniqueCarrier) %>%
summarise(mean(Distance))
```
Look at the `group_by` line. The `group_by` function should take two arguments, the tibble, and then the grouping variable. It appears to have only one argument. But look at the previous line. The pipe says to insert whatever is on its left (`hf`) into the first slot of the function on its right (`group_by`). So the net effect is still to evaluate the function `group_by(hf, UniqueCarrier)`.
Now look at the `summarise` line. Again, `summarise` is a function of two inputs, but all we see is the part that finds the mean. The pipe at the end of the previous line tells the `summarise` function to insert the stuff already computed (the grouped tibble returned by `group_by(hf, UniqueCarrier)`) into the first slot of the `summarise` function.
Piping takes a little getting used to, but once you're good at it, you'll never go back. It's just makes more sense semantically. When I read the above set of commands, I see a set of instructions in chronological order:
- Start with the tibble `hf`.
- Next, group by the carrier.
- Next, summarize each group using the mean distance.
Now we can assign the result of all that to the new variable `hf_summ`:
```{r}
hf_summ <- hf %>%
group_by(UniqueCarrier) %>%
summarise(mean(Distance))
hf_summ
```
Some people even take this one step further. The result of all the above is assigned to a new variable `hf_summ` that currently appears as the first command (`hf_summ <- ...`) But you could write this as
```{r}
hf %>%
group_by(UniqueCarrier) %>%
summarise(mean(Distance)) -> hf_summ
```
Now it says the following:
- Start with the tibble `hf`.
- Next, group by the carrier.
- Next, summarize each group using the mean distance.
- *Finally*, assign the result to a new variable called `hf_summ`.
In other words, the arrow operator for assignment works both directions!
Let's try some counting. This one is common enough that `dplyr` doesn't even make us use `group_by` and `summarise`. We can just use the command `count`. What if we wanted to know how many flights correspond to each carrier?
```{r}
hf_summ2 <- hf %>%
count(UniqueCarrier)
hf_summ2
```
Also note that we can give summary columns a new name if we wish. In `hf_summ`, we didn't give the new column an explicit name, so it showed up in our tibble as a column called `mean(Distance)`. If we want to change it, we can do this:
```{r}
hf_summ <- hf %>%
group_by(UniqueCarrier) %>%
summarise(mean_dist = mean(Distance))
hf_summ
```
Look at the earlier version of `hf_summ` and compare it to the one above. Make sure you see that the name of the second column changed.
The new count column of `hf_summ2` is just called `n`. That's okay, but if we insist on giving it a more user-friendly name, we can do so as follows:
```{r}
hf_summ2 <- hf %>%
count(UniqueCarrier, name = "total_count")
hf_summ2
```
This is a little different because it requires us to use a `name` argument and put the new name in quotes.
##### Exercise 14(a) {-}
Create a tibble called `hf_summ3` that lists the total count of flights for each day of the week. Be sure to use the pipe as above. Print the output to the screen. (You don't need to give the count column a new name.)
::: {.answer}
```{r}
# Add code here to define hf_summ3.
# Add code here to print hf_summ3.
```
:::
##### Exercise 14(b) {-}
According to the output in the previous part, what day of the week had the fewest flights? (Assume 1 = Monday.)
::: {.answer}
Please write up your answer here.
:::
*****
The tibbles created in this section are all just a few rows each. They don't take up much memory, so we don't really need to remove them. You can if you want, but it's not necessary.
## Putting it all together {#manipulating-all-together}
Often we need more than one of these verbs. In many data analyses, we need to do a sequence of operations to get at the answer we seek. This is most easily accomplished using a more complicated sequence of pipes.
Here's a example of multi-step piping. Let's say that we only care about Delta flights, and even then, we only want to know about the month of the flight and the departure delay. From there, we wish to group by month so we can find the maximum departure delay by month. Here is a solution, piping hot and ready to go. [groan]
```{r}
hf_grand_finale <- hf %>%
filter(UniqueCarrier == "DL") %>%
dplyr::select(Month, DepDelay) %>%
group_by(Month) %>%
summarise(max_delay = max(DepDelay, na.rm = TRUE))
hf_grand_finale
```
Go through each line of code carefully and translate it into English:
- We define a variable called `hf_grand_finale` that starts with the original `hf` data.
- We `filter` this data so that only Delta flights will be analyzed.
- We `select` the variables `Month` and `DepDelay`, throwing away all other variables that are not of interest to us. (Don't forget to use the `dplyr::select` syntax to make sure we get the right function!)
- We `group_by` month so that the results will be displayed by month.
- We `summarise` each month by listing the maximum value of `DepDelay` that appears within each month.
- We print the result to the screen.
Notice in the `summarise` line, we again took advantage of `dplyr`'s ability to rename any variable along the way, assigning our computation to the new variable `max_delay`. Also note the need for `na.rm = TRUE` so that the `max` command ignores any missing values.
A minor simplification results from the realization that `summarise` must throw away any variables it doesn't need. (Think about why for a second: what would `summarise` do with, say, `ArrTime` if we've only asked it to calculate the maximum value of `DepDelay` for each month?) So you could write this instead, removing the `select` clause:
```{r}
hf_grand_finale <- hf %>%
filter(UniqueCarrier == "DL") %>%
group_by(Month) %>%
summarise(max_delay = max(DepDelay, na.rm = TRUE))
hf_grand_finale
```
Check that you get the same result. With *massive* data sets, it's possible that the selection and sequence of these verbs matter, but you don't see an appreciable difference here, even with `r NROW(hf)` rows. (There are ways of benchmarking performance in R, but that is a more advanced topic.)
##### Exercise 15 {-}
Summarize in your own words what information is contained in the `hf_grand_finale` tibble. In other words, what are the numbers in the `max_delay` column telling us? Be specific!
::: {.answer}
Please write up your answer here.
:::
The remaining exercises are probably the most challenging you've seen so far in the course. Take each slowly. Read the instructions carefully. Go back through the chapter and identify which "verb" needs to be used for each part of the task. Examine the sample code in those sections carefully to make sure you get the syntax right. Create a sequence of "pipes" to do each task, one-by-one. Copy and paste pieces of code from earlier and make minor changes to adapt the code to the given task.
##### Exercise 16 {-}
Create a tibble that counts the flights to LAX grouped by day of the week. (Hint: you need to `filter` to get flights to LAX. Then you'll need to `count` using `DayOfWeek` as a grouping variable. Because you're using `count`, you don't need `group_by` or `summarise`.) Print the output to the screen.
::: {.answer}
```{r}
# Add code here to count the flights to LAX
# grouped by day of the week.
# Print the output to the screen.
```
:::
##### Exercise 17 {-}
Create a tibble that finds the median distance flight for each airline. Sort the resulting tibble from highest distance to lowest. (Hint: You'll need to `group_by` carrier and `summarise` using the `median` function. Finally, you'll need to `arrange` the result according to the median distance variable that you just created.) Print the output to the screen.
::: {.answer}
```{r}
# Add code here to find the median distance by airline.
# Print the output to the screen.
```
:::
## Conclusion {#manipulating-conclusion}
Raw data often doesn't come in the right form for us to run our analyses. The `dplyr` verbs are powerful tools for manipulating tibbles until they are in the right form.
### Preparing and submitting your assignment {#manipulating-prep}
1. From the "Run" menu, select "Restart R and Run All Chunks".
2. Deal with any code errors that crop up. Repeat steps 1–-2 until there are no more code errors.
3. Spell check your document by clicking the icon with "ABC" and a check mark.
4. Hit the "Preview" button one last time to generate the final draft of the `.nb.html` file.
5. Proofread the HTML file carefully. If there are errors, go back and fix them, then repeat steps 1--5 again.
If you have completed this chapter as part of a statistics course, follow the directions you receive from your professor to submit your assignment.