-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy path50-joining-tables.Rmd
453 lines (338 loc) · 15.1 KB
/
50-joining-tables.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
# Joining tables {#sec-join}
**Learning Objectives**
At the end of this section, students should understand
* the need and concept of table joins,
* differences between the different types of joins,
* the importance of keys in joins,
* circumstances leading to the appearance of missing values,
* the implications of using non-unique keys.
In many real life situations, data are spread across multiple tables
or spreadsheets. Usually this occurs because different types of
information about a subject, e.g. a patient, are collected from
different sources. It may be desirable for some analyses to combine
data from two or more tables into a single data frame based on a
common column, for example, an attribute that uniquely identifies the
subject.
The `dplyr` package, that we have already used extensively, provides a
set of join functions for combining two data frames based on matches
within specified columns.
For further reading, please refer to the chapter about [table
joins](https://r4ds.had.co.nz/relational-data.html#understanding-joins)
in @r4ds:2017.
```{r cleanup, echo = FALSE}
rm(list = ls())
```
The [Data Transformation Cheat
Sheet](https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf)
also provides a short overview on table joins.
## Combining tables
We are going to illustrate join using a common example from the
bioinformatics world, where annotations about genes are scattered in
different tables that have one or more shared columns. The data we are
going to use are available in the course package and can be loaded as
shown below.
```{r joindata}
library("rWSBIM1207")
data(jdf)
```
The example data is composed of pairs of tables (we have tibbles here,
but this would equally work with dataframes). The first member of the
pair contains protein [UniProt](https://www.uniprot.org/)[^up] unique
accession number (`uniprot` variable), the most likely sub-cellular
localisation of these respective proteins (`organelle` variable) as
well as the proteins identifier (`entry`).
[^up]: UniProt is the protein information database. Its mission is to *provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information*.
```{r jdf1}
jdf1
```
The second table contains the name of the gene that codes for the
protein (`gene_name` variable), a description of the gene
(`description` variable), the uniprot accession number (this is the
common variable that can be used to join tables) and the species the
protein information comes from (`organism` variable).
```{r jdf2}
jdf2
```
We now want to join these two tables into a single one containing all
variables. We are going to use `dplyr`'s `full_join` function to do
so, that finds the common variable (in this case `uniprot`) to match
observations from the first and second table.
```{r join1}
library("dplyr")
full_join(jdf1, jdf2)
```
In the examples above, each observation of the `jdf1` and `jdf2`
tables are uniquely identified by their UniProt accession number. Such
variables are called **keys**. Keys are used to match observations
across different tables.
In case none of the variable names match, those to be used can be set
manually using the `by` argument, as shown below with the `jdf1` (as
above) and `jdf3` tables, where the UniProt accession number is
encoded using a different capitalisation.
```{r joinby}
names(jdf3)
full_join(jdf1, jdf3, by = c("uniprot" = "UniProt"))
```
As can be seen above, the variable name of the first table is retained
in the joined one.
`r msmbstyle::question_begin()`
Using the `full_join` function demonstrated above, join tables `jdf4`
and `jdf5`. What has happened for observations `P26039` and `P02468`?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r joinex1}
full_join(jdf4, jdf5)
```
`P02468` and `P02468` are only present in `jdf4` and `jdf5`
respectively, and their respective values for the variables of the
table have been encoded as missing.
`r msmbstyle::solution_end()`
## Different types of joins
Above, we have used the `full_join` function, that fully joins two
tables and keeps all observations, adding missing values if
necessary. Sometimes, we want to be selective, and keep observations
that are present in only one or both tables.
- An **inner join** keeps observations that are present in both
tables.
```{r, results='markup', fig.cap="An inner join matches pairs of observation matching in both tables, this dropping those that are unique to one table. Figure taken from *R for Data Science*.", echo=FALSE, purl=FALSE, out.width='70%', fig.align='center'}
knitr::include_graphics("./figs/join-inner.png")
```
- A **left join** keeps observations that are present in the left
(first) table, dropping those that are only present in the other.
- A **right join** keeps observations that are present in the right
(second) table, dropping those that are only present in the other.
- A **full join** keeps all observations.
```{r, results='markup', fig.cap="Outer joins match observations that appear in at least on table, filling up missing values with `NA` values. Figure taken from *R for Data Science*.", echo=FALSE, purl=FALSE, out.width='70%', fig.align='center'}
knitr::include_graphics("./figs/join-outer.png")
```
`r msmbstyle::question_begin()`
Join tables `jdf4` and `jdf5`, keeping only observations in `jdf4`.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r leftjoinex1}
left_join(jdf4, jdf5)
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Join tables `jdf4` and `jdf5`, keeping only observations in `jdf5`.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r rightjoinex1}
right_join(jdf4, jdf5)
```
`r msmbstyle::solution_end()`
`r msmbstyle::question_begin()`
Join tables `jdf4` and `jdf5`, keeping observations observed in both tables.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r innerjoinex1}
inner_join(jdf4, jdf5)
```
`r msmbstyle::solution_end()`
## Multiple matches
Sometimes, keys aren't unique. In the `jdf6` table below, we see that
the accession number `Q99PL5` is repeated twice. According to this
table, the ribosomial protein binding protein 1 localises in the
[endoplasmic
reticulum](https://en.wikipedia.org/wiki/Endoplasmic_reticulum) (often
abbreviated ER) and in the [Golgi
apparatus](https://en.wikipedia.org/wiki/Golgi_apparatus) (often
abbreviated GA).
```{r jdf6}
jdf6
```
If we now want to join `jdf6` and `jdf2`, the variables of the latter
will be duplicated.
```{r multexple}
inner_join(jdf6, jdf2)
```
In the case above, repeating is useful, as it completes `jdf6` with
correct information from `jdf2`. One needs however to be careful when
duplicated keys exist in both tables. Below, we create an inner join
between `jdf6` and `jdf7`, both having duplicated `Q99PL5` entries.
```{r multproblem}
inner_join(jdf6, jdf7)
```
`r msmbstyle::question_begin()`
Interpret the result of the inner join above, where both tables have duplicated keys.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
`jdf6` has two entries, one for each possible sub-cellular
localisation of the protein. `jdf7` has also two entries, referring to
two different quantitative measurements (variable `measure`). When
joining the duplicated keys, you get all possible combinations.
```{r, results='markup', fig.cap="Joins with duplicated keys in both tables, producing all possible combinations. Figure taken from *R for Data Science*.", echo=FALSE, purl=FALSE, out.width='70%', fig.align='center'}
knitr::include_graphics("./figs/join-many-to-many.png")
```
In this case, we obtain wrong information: both proteins in the ER and
in the GA both have value 102 and 3.
`r msmbstyle::solution_end()`
## Matching across multiple keys
So far, we have matched tables using a single key (possibly with
different names in the two tables). Sometimes, it is necessary to
match tables using multiple keys. A typical example is when multiple
variables are needed to discriminate different rows in a tables.
Following up from the last example, we see that the duplicated UniProt
accession numbers in the `jdf6` and `jdf7` tables refer to different
[isoforms](https://en.wikipedia.org/wiki/Protein_isoform) of the same
RRBP1 gene. To uniquely identify isoforms, we need to consider two
keys, namely the UniProt accession number (named `uniprot` in both
tables) as well as the isoform number, called `isoform` and
`isoform_num` respectively.
Because the isoform status was encoded using different variable names
(which is, of course a source of confusion), `jdf6` and `jdf7` are
only automatically joined based on the shared `uniprot` key. Here, we
need to join using both keys and need to explicitly name the variables
used for the join.
```{r morekeys}
inner_join(jdf6, jdf7, by = c("uniprot" = "uniprot", "isoform" = "isoform_num"))
```
We now see that isoform 1 localised to the ER and has a measured value
of 102, while isoform 2, that localised to the GA, has a measured
value of 3.
`r msmbstyle::question_begin()`
Can you think of another way to merge tables `jdf6` and `jdf7` using
the two keys?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
Ideally, the isoform variables should be named identically in the two
tables, which would enable and automatic join with the two
keys. Below, we first fix the misnamed variable in `jdf7`. Instead of
updating the variable name by checking its index manually, we grep it
programmatically and store it in a new variable `i`. We can then join
the two tables without having to specify the two keys explicitly.
```{r morekeysex}
i <- grep("isoform", names(jdf7))
names(jdf7)[i] <- "isoform"
inner_join(jdf6, jdf7)
```
`r msmbstyle::solution_end()`
## Merge in base R
Above, we have used several join functions from the `dplyr` package as
they are convenient and easy to remember. The equivalent function in
the `base` package, that is installed with R, is the `merge`
function. The table below shows how these are related:
| dplyr | merge |
|--------------------|-------------------------------------------|
| `inner_join(x, y)` | `merge(x, y)` |
| `left_join(x, y)` | `merge(x, y, all.x = TRUE)` |
| `right_join(x, y)` | `merge(x, y, all.y = TRUE),` |
| `full_join(x, y)` | `merge(x, y, all.x = TRUE, all.y = TRUE)` |
Even if you decide to stick with one of these alternatives, it is
important to be aware of the other one, especially given the
widespread usage of `merge` in many packages and in R itself.
## Row and column binding
There are other two important functions in R, that can be used to
combine two dataframes, but assume that these already match
beforehand, as summerised in figure \@ref(fig:bindfig) below.
```{r bindfig, fig.width = 6, fig.asp = 1, fig.cap = "Matching dimension and names when binding by rows and columns.", echo = FALSE}
plot(1:10, type = "n", bty = "n", axes = FALSE, ann = FALSE,
xlim = c(1, 11), ylim = c(1, 11))
rect(1, 7.5, 5, 10)
segments(1, 9.55, 5, 9.55)
segments(1, 9.5, 5, 9.5)
text(1, 9.75, labels = paste(paste0("V", 1:5, collapse = " "), "..." , collapse = " "), pos = 4, offset = 0.1)
segments(seq(1, 5, 0.5), rep(7.5, 10),
seq(1, 5, 0.5), rep(9.55, 10))
rect(1, 1, 5, 7)
segments(1, 6.55, 5, 6.55)
text(1, 6.75, labels = paste(paste0("V", 1:5, collapse = " "), "..." , collapse = " "), pos = 4, offset = 0.1)
segments(seq(1, 5, 0.5), rep(1, 10),
seq(1, 5, 0.5), rep(6.55, 10))
segments(rep(1, 12), seq(1, 6.5, 0.5),
rep(5, 12), seq(1, 6.5, 0.5))
rect(6, 1, 11, 7)
segments(6, 6.55, 11, 6.55)
text(6, 6.75, labels = paste(paste0("X", 1:8, collapse = " "), "..." , collapse = " "), pos = 4, offset = 0.1)
segments(rep(6, 12), seq(1, 6.5, 0.5),
rep(11, 12), seq(1, 6.5, 0.5))
abline(v = seq(1, 5, 0.5), lty = "dotted", col = "grey")
abline(h = seq(1, 6.5, 0.5), lty = "dotted", col = "grey")
```
```{r, echo = FALSE}
d1 <- data.frame(x = 1:3, y = 1:3, row.names = paste0("r", 1:3))
d2 <- data.frame(a = 4:5, b = 4:5, row.names = paste0("r", 4:5))
d3 <- data.frame(v1 = 1:2, v2 = 3:4, v3 = 5:6, row.names = LETTERS[1:2])
```
We are going to illustrate binding by columns with dataframes `d1` and
`d2`, and then binding by rows using `d2` and `d3`. Lets start with
`d1` and `d2` shown below; both have the same number of columns but,
and this is crucial, do not have the same column names:
```{r}
d1
d2
```
While the number of columns match, the names don't, which results in an
error[^try] when we use `rbind`:
```{r}
try(rbind(d1, d2))
```
[^try]: The failing call to `rbind` is wrapped into a `try` call here
to stop the error from aborting the document compilation.
Before `rbind`ing two dataframes, we must assure that their number of
columns and colnames match exactly:
```{r}
names(d2) <- names(d1)
rbind(d1, d2)
```
If we want to bind to dataframes along their columns, we must make
sure that their number of rows match; neither rownames nor colnames
hinder here:
```{r}
d2
d3
cbind(d2, d3)
```
Note that beyond the dimensions and column names that are required to
match, the real meaning of `rbind` is to bind dataframes that contain
observations for the same set of variables - there is more than only
the column names. Below, we `rbind` dataframes with identical column
names but different variables, which end up all being coerced into
characters.
```{r}
d1
d4 <- data.frame(x = letters[1:2], y = letters[1:2])
str(rbind(d1, d4))
```
**Note**: `rbind` and `cbind` are base R functions. The *tidyverse*
alternatives from the `dplyr` package are `bind_rows` and `bind_cols`
and work similarly.
## Additional exercises
`r msmbstyle::question_begin()`
Using the `jdf4` and `jdf5` tables, emulate the left, right and inner
joins using a the full join and filter functions.
`r msmbstyle::question_end()`
`r msmbstyle::question_begin()`
Load the `rWSBIM1207` package. Using the `data` function, directly
load the `clinical2` and `expression` data into your global
environment.
- Inspect the clinical data. What kind of information do we have and
how many patients are recorded?
- Inspect the expression data. How many samples are recorded?
- Join the `expression` and `clinical2` tables by the patient
reference, using the `left_join` and the `right_join` functions. Why
are the results different?
- Join `expression` and `clinical2` tables in order to create a table
containing merged data exclusively for normal samples.
`r msmbstyle::question_end()`
```{r, echo=FALSE, include=FALSE}
library("rWSBIM1207")
data(expression)
data(clinical2)
nrow(clinical2)
nrow(expression)
## Join the `expression` and `clinical2` tables by the patient
## reference, using the `left_join` and the `right_join`
## functions. Why are the results different?
right_join(expression, clinical2, by = c("patient" = "patientID"))
left_join(expression, clinical2, by = c("patient" = "patientID"))
## Difference due to 3 patients recorded in clinical data are not
## recorded in expression.
length(clinical2$patientID)
length(unique(expression$patient))
## Join `expression` and `clinical2` tables in order to create a table
## containing merged data exclusively for normal samples.
expression %>%
filter(type == "normal") %>%
left_join(clinical2, by = c("patient" = "patientID"))
```