-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathexample_analysis.Rmd
121 lines (92 loc) · 5.9 KB
/
example_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
title: "Data creation and analysis examples"
date: "`r Sys.Date()`"
output:
md_document:
variant: gfm
toc: yes
html_notebook:
toc: yes
---
The below code loads common basis functions from [src/common_basis.R](src/common_basis.R) (executing any other cell in the notebook runs this code automatically first if it has not already been run). The template assumes that functions useful for most work be defined in `common_basis.R`, whereas code useful for individual analyses is defined where needed.
Naturally, if more refined organisation of common code is needed, one is also free to define whichever other modules one wants.
The central object defined in `common_basis.R` is `con`, which is the [MariaDB](https://mariadb.com/) (MySQL) database connection (a [DBI](https://dbi.r-dbi.org/) connection) through which both ready data is accessed, as well as new data stored for others to reuse. Below, you will see both how to use con to store data in the database, as well as how query it.
The details of the database connection are stored in [`db_params.yaml`](/db_params.yaml). The password is given separately. **DO NOT INCLUDE THE PASSWORD IN ANY CODE YOU COMMIT TO GITHUB**. If running this notebook, it will ask for the password the first time you run it, and then store it separately in your keyring. This requires a working keyring implementation on your system. Consult the [`keyring`](https://r-lib.github.io/keyring/index.html) package documentation if you have problems. If you cannot get it to work, a second option is to create a `db_secret.yaml` file in the project root directory with `db_pass: [PASSWORD]` as the content. This file is already set to be ignored by Git so it wouldn't accidentally get included in a commit, but still, if you do this, **DON'T MAKE THE MISTAKE OF COMMITTING THE FILE TO GITHUB**.
```{r setup}
source(here::here("src/common_basis.R"))
```
In general, I suggest sticking to [`tidyverse`](https://www.tidyverse.org/) when working with this data (and any data in general). To get to know R `tidyverse`, here are some great tutorials:
* https://moderndive.com/index.html chapters 1-4 are the best intro into R and tidyverse that I know of.
* Then, if you're interested in expanding your visual analysis capabilities, following up with https://socviz.co/ is good.
* Excellent cheatsheets to keep on hand or even print while learning: https://www.rstudio.com/resources/cheatsheets/
* And finally “R for Data Science” is a central work by the authors of the packages, but I don’t think as good as the above: https://r4ds.had.co.nz/. Can be used to supplement the modern dive if something remains unclear or needs deepening after that.
# Storing data in the database
```{r storing_data_in_the_database}
# create some sample data
d <- tibble(
year=c(1700, 1710, 1710, 1713),
type=c('book', 'pamphlet', 'pamphlet', 'pamphlet'),
genre=c('drama', 'literature', 'science', 'drama')
)
# this copies the sample data to the database. con is the connection object that is defined in common_basis. Additionally for efficiency reasons, we want the created table to be an Aria table, which the custom copy_to_a function (https://hsci-r.github.io/hscidbutil/reference/copy_to_a.html) offers.
test_a <- d %>%
copy_to_a(con, name="test_a", temporary=FALSE, overwrite=TRUE, indexes=c("year","type","genre"))
#Here, we also create a ColumnStore (https://mariadb.com/kb/en/mariadb-columnstore/) version of the table. ColumnStore tables are very good for large aggregate queries, but support for them is still a bit quirky, so some queries may just not work. They're also slower for queries that mainly seek and return individual full rows of data. Further, it is better not to mix ColumnStore and other tables in queries.
test_c <- test_a %>%
compute_c(name="test_c", temporary=FALSE, overwrite=TRUE)
```
In the above, `copy_to` copies local data into the database, while `compute` takes the result of a [`dbplyr`](https://dbplyr.tidyverse.org/articles/dbplyr.html) pipeline and stores it in a table. Here, the pipeline is a s simple as taking everything in `test_a`.
# Data analysis
By default, `dbplyr` tables are lazy and keep the data in the database. However, you can use them as normal, and they pull in the data they need, when they need:
```{r}
test_a
```
Because `dbplyr` is able to transform many [`tidyverse`](https://www.tidyverse.org/) operations transparently into SQL queries, one can also compute more complex results just using tidyverse:
```{r}
test_a %>%
group_by(year) %>%
summarise(books=n(),.groups="drop")
```
If you need to explicitly copy data to a local tibble, you can use `collect()` (this can happen for example when some function you want to use on the data doesn't exist as a `dbplyr` translation).
```{r}
test_a %>%
collect()
```
However, often one can just use the remote tables as one would use local tibbles. This ensures for example that only the aggregated data that needed for a final visualisation is pulled locally, which is crucial for large datasets which may not even fit wholly in memory:
```{r example_analysis, fig.width=5, fig.height=5}
test_c %>%
group_by(year,genre,type) %>%
summarize(books=n(), .groups="drop") %>%
ggplot(aes(x=year,y=books,fill=type)) +
scale_y_continuous(breaks=c(0,3)) +
scale_x_continuous(breaks=seq(1700,1800,by=2)) +
geom_col()
```
If you ever need to see the queries dbplyr creates, you can use `show_query()` or `explain()` for that:
```{r}
test_c %>%
group_by(year,genre,type) %>%
summarize(books=n(), .groups="drop") %>%
show_query()
test_a %>%
group_by(year,genre,type) %>%
summarize(books=n(), .groups="drop") %>%
explain()
```
You can also run arbitrary SQL queries as follows:
```{r}
tbl(con, sql("
SELECT year, COUNT(*) AS books
FROM test_c
GROUP BY year
"))
```
And you can even pass these on to further tidyverse functions:
```{r}
tbl(con, sql("
SELECT year, COUNT(*) AS books
FROM test_c
GROUP BY year
")) %>%
arrange(year)
```