50-joining-tables.Rmd

# Joining tables {#sec-join}

**Learning Objectives**

At the end of this section, students should understand

* the need and concept of table joins,
* differences between the different types of joins,
* the importance of keys in joins,
* circumstances leading to the appearance of missing values,
* the implications of using non-unique keys.

In many real life situations, data are spread across multiple tables
or spreadsheets. Usually this occurs because different types of
information about a subject, e.g. a patient, are collected from
different sources. It may be desirable for some analyses to combine
data from two or more tables into a single data frame based on a
common column, for example, an attribute that uniquely identifies the
subject.

The `dplyr` package, that we have already used extensively, provides a
set of join functions for combining two data frames based on matches
within specified columns.

For further reading, please refer to the chapter about [table
joins](https://r4ds.had.co.nz/relational-data.html#understanding-joins)
in @r4ds:2017.

```{r cleanup, echo = FALSE}
rm(list = ls())
```

The [Data Transformation Cheat
Sheet](https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf)
also provides a short overview on table joins.


## Combining tables

We are going to illustrate join using a common example from the
bioinformatics world, where annotations about genes are scattered in
different tables that have one or more shared columns. The data we are
going to use are available in the course package and can be loaded as
shown below.

```{r joindata}
library("rWSBIM1207")
data(jdf)
```

The example data is composed of pairs of tables (we have tibbles here,
but this would equally work with dataframes). The first member of the
pair contains protein [UniProt](https://www.uniprot.org/)[^up] unique
accession number (`uniprot` variable), the most likely sub-cellular
localisation of these respective proteins (`organelle` variable) as
well as the proteins identifier (`entry`).

[^up]: UniProt is the protein information database. Its mission is to *provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information*.

```{r jdf1}
jdf1
```

The second table contains the name of the gene that codes for the
protein (`gene_name` variable), a description of the gene
(`description` variable), the uniprot accession number (this is the
common variable that can be used to join tables) and the species the
protein information comes from (`organism` variable).

```{r jdf2}
jdf2
```

We now want to join these two tables into a single one containing all
variables. We are going to use `dplyr`'s `full_join` function to do
so, that finds the common variable (in this case `uniprot`) to match
observations from the first and second table.

```{r join1}
library("dplyr")
full_join(jdf1, jdf2)
```

In the examples above, each observation of the `jdf1` and `jdf2`
tables are uniquely identified by their UniProt accession number. Such
variables are called **keys**. Keys are used to match observations
across different tables.

In case none of the variable names match, those to be used can be set
manually using the `by` argument, as shown below with the `jdf1` (as
above) and `jdf3` tables, where the UniProt accession number is
encoded using a different capitalisation.

```{r joinby}
names(jdf3)
full_join(jdf1, jdf3, by = c("uniprot" = "UniProt"))
```

As can be seen above, the variable name of the first table is retained
in the joined one.

`r msmbstyle::question_begin()`
Using the `full_join` function demonstrated above, join tables `jdf4`
and `jdf5`. What has happened for observations `P26039` and `P02468`?
`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`
```{r joinex1}
full_join(jdf4, jdf5)
```

`P02468` and `P02468` are only present in `jdf4` and `jdf5`
respectively, and their respective values for the variables of the
table have been encoded as missing.

`r msmbstyle::solution_end()`

## Different types of joins

Above, we have used the `full_join` function, that fully joins two
tables and keeps all observations, adding missing values if
necessary. Sometimes, we want to be selective, and keep observations
that are present in only one or both tables.

- An **inner join** keeps observations that are present in both
  tables.

```{r, results='markup', fig.cap="An inner join matches pairs of observation matching in both tables, this dropping those that are unique to one table. Figure taken from *R for Data Science*.", echo=FALSE, purl=FALSE, out.width='70%', fig.align='center'}
knitr::include_graphics("./figs/join-inner.png")
```

- A **left join** keeps observations that are present in the left
  (first) table, dropping those that are only present in the other.
- A **right join** keeps observations that are present in the right
  (second) table, dropping those that are only present in the other.
- A **full join** keeps all observations.


```{r, results='markup', fig.cap="Outer joins match observations that appear in at least on table, filling up missing values with `NA` values. Figure taken from *R for Data Science*.", echo=FALSE, purl=FALSE, out.width='70%', fig.align='center'}
knitr::include_graphics("./figs/join-outer.png")
```

`r msmbstyle::question_begin()`
Join tables `jdf4` and `jdf5`, keeping only observations in `jdf4`.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r leftjoinex1}
left_join(jdf4, jdf5)
```
`r msmbstyle::solution_end()`

`r msmbstyle::question_begin()`
Join tables `jdf4` and `jdf5`, keeping only observations in `jdf5`.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r rightjoinex1}
right_join(jdf4, jdf5)
```
`r msmbstyle::solution_end()`

`r msmbstyle::question_begin()`
Join tables `jdf4` and `jdf5`, keeping observations observed in both tables.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r innerjoinex1}
inner_join(jdf4, jdf5)
```
`r msmbstyle::solution_end()`

## Multiple matches

Sometimes, keys aren't unique. In the `jdf6` table below, we see that
the accession number `Q99PL5` is repeated twice. According to this
table, the ribosomial protein binding protein 1 localises in the
[endoplasmic
reticulum](https://en.wikipedia.org/wiki/Endoplasmic_reticulum) (often
abbreviated ER) and in the [Golgi
apparatus](https://en.wikipedia.org/wiki/Golgi_apparatus) (often
abbreviated GA).

```{r jdf6}
jdf6
```

If we now want to join `jdf6` and `jdf2`, the variables of the latter
will be duplicated.

```{r multexple}
inner_join(jdf6, jdf2)
```

In the case above, repeating is useful, as it completes `jdf6` with
correct information from `jdf2`. One needs however to be careful when
duplicated keys exist in both tables. Below, we create an inner join
between `jdf6` and `jdf7`, both having duplicated `Q99PL5` entries.

```{r multproblem}
inner_join(jdf6, jdf7)
```

`r msmbstyle::question_begin()`
Interpret the result of the inner join above, where both tables have duplicated keys.
`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`
`jdf6` has two entries, one for each possible sub-cellular
localisation of the protein. `jdf7` has also two entries, referring to
two different quantitative measurements (variable `measure`). When
joining the duplicated keys, you get all possible combinations.

```{r, results='markup', fig.cap="Joins with duplicated keys in both tables, producing all possible combinations. Figure taken from *R for Data Science*.", echo=FALSE, purl=FALSE, out.width='70%', fig.align='center'}
knitr::include_graphics("./figs/join-many-to-many.png")
```

In this case, we obtain wrong information: both proteins in the ER and
in the GA both have value 102 and 3.
`r msmbstyle::solution_end()`


## Matching across multiple keys

So far, we have matched tables using a single key (possibly with
different names in the two tables). Sometimes, it is necessary to
match tables using multiple keys. A typical example is when multiple
variables are needed to discriminate different rows in a tables.

Following up from the last example, we see that the duplicated UniProt
accession numbers in the `jdf6` and `jdf7` tables refer to different
[isoforms](https://en.wikipedia.org/wiki/Protein_isoform) of the same
RRBP1 gene. To uniquely identify isoforms, we need to consider two
keys, namely the UniProt accession number (named `uniprot` in both
tables) as well as the isoform number, called `isoform` and
`isoform_num` respectively.

Because the isoform status was encoded using different variable names
(which is, of course a source of confusion), `jdf6` and `jdf7` are
only automatically joined based on the shared `uniprot` key. Here, we
need to join using both keys and need to explicitly name the variables
used for the join.


```{r morekeys}
inner_join(jdf6, jdf7, by = c("uniprot" = "uniprot", "isoform" = "isoform_num"))
```

We now see that isoform 1 localised to the ER and has a measured value
of 102, while isoform 2, that localised to the GA, has a measured
value of 3.


`r msmbstyle::question_begin()`
Can you think of another way to merge tables `jdf6` and `jdf7` using
the two keys?
`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`
Ideally, the isoform variables should be named identically in the two
tables, which would enable and automatic join with the two
keys. Below, we first fix the misnamed variable in `jdf7`. Instead of
updating the variable name by checking its index manually, we grep it
programmatically and store it in a new variable `i`. We can then join
the two tables without having to specify the two keys explicitly.

```{r morekeysex}
i <- grep("isoform", names(jdf7))
names(jdf7)[i] <- "isoform"
inner_join(jdf6, jdf7)
```


`r msmbstyle::solution_end()`


## Merge in base R

Above, we have used several join functions from the `dplyr` package as
they are convenient and easy to remember. The equivalent function in
the `base` package, that is installed with R, is the `merge`
function. The table below shows how these are related:

| dplyr              | merge                                     |
|--------------------|-------------------------------------------|
| `inner_join(x, y)` | `merge(x, y)`                             |
| `left_join(x, y)`  | `merge(x, y, all.x = TRUE)`               |
| `right_join(x, y)` | `merge(x, y, all.y = TRUE),`              |
| `full_join(x, y)`  | `merge(x, y, all.x = TRUE, all.y = TRUE)` |


Even if you decide to stick with one of these alternatives, it is
important to be aware of the other one, especially given the
widespread usage of `merge` in many packages and in R itself.

## Row and column binding

There are other two important functions in R, that can be used to
combine two dataframes, but assume that these already match
beforehand, as summerised in figure \@ref(fig:bindfig) below.

```{r bindfig, fig.width = 6, fig.asp = 1, fig.cap = "Matching dimension and names when binding by rows and columns.", echo = FALSE}
plot(1:10, type = "n", bty = "n", axes = FALSE, ann = FALSE,
     xlim = c(1, 11), ylim = c(1, 11))

rect(1, 7.5, 5, 10)
segments(1, 9.55, 5, 9.55)
segments(1, 9.5, 5, 9.5)
text(1, 9.75, labels = paste(paste0("V", 1:5, collapse = " "), "..." , collapse = " "), pos = 4, offset = 0.1)
segments(seq(1, 5, 0.5), rep(7.5, 10),
         seq(1, 5, 0.5), rep(9.55, 10))

rect(1, 1, 5, 7)
segments(1, 6.55, 5, 6.55)
text(1, 6.75, labels = paste(paste0("V", 1:5, collapse = " "), "..." , collapse = " "), pos = 4, offset = 0.1)
segments(seq(1, 5, 0.5), rep(1, 10),
         seq(1, 5, 0.5), rep(6.55, 10))
segments(rep(1, 12), seq(1, 6.5, 0.5),
         rep(5, 12), seq(1, 6.5, 0.5))

rect(6, 1, 11, 7)
segments(6, 6.55, 11, 6.55)
text(6, 6.75, labels = paste(paste0("X", 1:8, collapse = " "), "..." , collapse = " "), pos = 4, offset = 0.1)
segments(rep(6, 12), seq(1, 6.5, 0.5),
         rep(11, 12), seq(1, 6.5, 0.5))

abline(v = seq(1, 5, 0.5), lty = "dotted", col = "grey")
abline(h = seq(1, 6.5, 0.5), lty = "dotted", col = "grey")
```

```{r, echo = FALSE}
d1 <- data.frame(x = 1:3, y = 1:3, row.names = paste0("r", 1:3))
d2 <- data.frame(a = 4:5, b = 4:5, row.names = paste0("r", 4:5))
d3 <- data.frame(v1 = 1:2, v2 = 3:4, v3 = 5:6, row.names = LETTERS[1:2])
```

We are going to illustrate binding by columns with dataframes `d1` and
`d2`, and then binding by rows using `d2` and `d3`. Lets start with
`d1` and `d2` shown below; both have the same number of columns but,
and this is crucial, do not have the same column names:

```{r}
d1
d2
```

While the number of columns match, the names don't, which results in an
error[^try] when we use `rbind`:

```{r}
try(rbind(d1, d2))
```

[^try]: The failing call to `rbind` is wrapped into a `try` call here
    to stop the error from aborting the document compilation.

Before `rbind`ing two dataframes, we must assure that their number of
columns and colnames match exactly:

```{r}
names(d2) <- names(d1)
rbind(d1, d2)
```

If we want to bind to dataframes along their columns, we must make
sure that their number of rows match; neither rownames nor colnames
hinder here:

```{r}
d2
d3
cbind(d2, d3)
```

Note that beyond the dimensions and column names that are required to
match, the real meaning of `rbind` is to bind dataframes that contain
observations for the same set of variables - there is more than only
the column names. Below, we `rbind` dataframes with identical column
names but different variables, which end up all being coerced into
characters.


```{r}
d1
d4 <- data.frame(x = letters[1:2], y = letters[1:2])
str(rbind(d1, d4))
```


**Note**: `rbind` and `cbind` are base R functions. The *tidyverse*
alternatives from the `dplyr` package are `bind_rows` and `bind_cols`
and work similarly.

## Additional exercises

`r msmbstyle::question_begin()`

Using the `jdf4` and `jdf5` tables, emulate the left, right and inner
joins using a the full join and filter functions.

`r msmbstyle::question_end()`


`r msmbstyle::question_begin()`

Load the `rWSBIM1207` package.  Using the `data` function, directly
load the `clinical2` and `expression` data into your global
environment.

- Inspect the clinical data. What kind of information do we have and
  how many patients are recorded?

- Inspect the expression data. How many samples are recorded?

- Join the `expression` and `clinical2` tables by the patient
  reference, using the `left_join` and the `right_join` functions. Why
  are the results different?

- Join `expression` and `clinical2` tables in order to create a table
  containing merged data exclusively for normal samples.

`r msmbstyle::question_end()`


```{r, echo=FALSE, include=FALSE}
library("rWSBIM1207")
data(expression)
data(clinical2)

nrow(clinical2)
nrow(expression)

## Join the `expression` and `clinical2` tables by the patient
## reference, using the `left_join` and the `right_join`
## functions. Why are the results different?
right_join(expression, clinical2, by = c("patient" = "patientID"))
left_join(expression, clinical2, by = c("patient" = "patientID"))

## Difference due to 3 patients recorded in clinical data are not
## recorded in expression.
length(clinical2$patientID)
length(unique(expression$patient))

## Join `expression` and `clinical2` tables in order to create a table
## containing merged data exclusively for normal samples.
expression %>%
  filter(type == "normal") %>%
  left_join(clinical2, by = c("patient" = "patientID"))
```