-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathlab_week_02.Rmd
127 lines (75 loc) · 3.92 KB
/
lab_week_02.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
title: "EEEB UN3005/GR5005 \nLab - Week 02 - 03 and 05 February 2020"
author: "USE YOUR NAME HERE"
output: pdf_document
fontsize: 12pt
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
```
# Data Cleaning
To practice data cleaning, in this week's lab, we'll be using a subset of [published data](https://www.nature.com/articles/sdata201817) on RNA viruses collated by Mark Woolhouse and Liam Brierley. The entire dataset contains trait information gathered from the scientific literature on 214 RNA viruses that are known to infect humans. See the ["Data Records"](https://www.nature.com/articles/sdata201817#data-records) section of the published paper for information on the variables included in the full dataset. I've downloaded the data, converted it to a CSV file for your ease of use, and pulled out only a subset of the data to make it easier to work with. Our data subset contains information on 93 RNA viruses. Find the data subset on the class CourseWorks page as `Woolhouse_and_Brierley_RNA_virus_database_reduced.csv`.
## Exercise 1: Data Import
Download the Woolhouse and Brierley data, and import it into R, assigning it to an object named `viruses`. Run `summary()` on this object. You'll get a load of information in return, but this is just to familiarize yourself broadly with the dataset.
```{r}
```
## Exercise 2: Code Translation
For this series of exercises, you'll be given a chunk of code that does some data manipulation in base R. Your goal is to describe what this code is doing (in text below the code) and then translate that data manipulation operation using `dplyr` functions (in the empty code chunks). The `dplyr` solution will hopefully be simpler and more intuitive (which is why I'm encouraging you to learn `dplyr`). However, as an R user, you'll also be seeing lots of code written with base R functions, so best to be able to understand the basics of data manipulation with these built-in functions as well.
a)
- Base R code:
```{r}
viruses[viruses$Family == "Coronaviridae", ]
```
- `dplyr` equivalent:
```{r}
```
b)
- Base R code:
```{r}
viruses[1:10, c(1, 2, 3, 17)]
```
Hint: Look at the `dplyr` function called `slice()` using `?slice()`.
- `dplyr` equivalent:
```{r}
```
c)
- Base R code:
```{r}
sort(viruses$Species[viruses$Genome == "(+)ssRNA"])
```
- `dplyr` equivalent:
```{r}
```
## Exercise 3: Code Annotation
In the following series of exercises, you will be provided with functioning R code of `dplyr` data manipulation pipelines. Your goal is to comment these code blocks line-by-line, describing what each function is doing to create the final output. Please note, if you're not sure how a given line is functioning within the whole code block, this type of code is easily run in successively larger chunks. In other words, start by running the first line, then the first two lines, then the first three lines, etc. in order to see how the output changes. Additionally, reviewing function help files (e.g., `?some_function()`) may shed light on what's happening.
a)
```{r}
viruses %>%
mutate(Envelope_mod = ifelse(Envelope == 1, "enveloped", "not enveloped")) %>%
filter(Discovery.year >= 2000) %>%
select(Family, Species, Envelope_mod) %>%
arrange(Family, Species)
```
b)
```{r}
viruses %>%
group_by(Family) %>%
summarize(
n = n(),
n_enveloped = sum(Envelope),
proportion_enveloped = (n_enveloped/n)*100
) %>%
arrange(desc(n))
```
What do you notice about the `proportion_enveloped` column?
c)
```{r}
viruses %>%
group_by(Family) %>%
summarize(n_genome_types = n_distinct(Genome)) %>%
arrange(desc(n_genome_types))
```
What do you learn from this data summary about the number of distinct genome types per viral family?
## Bonus Exercise: Install `rethinking`
If you have not yet installed the `rethinking` package, now would be a good time to try to do so, using the instructions at https://github.com/rmcelreath/rethinking.