Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance and optimization vignette #277

Merged
merged 5 commits into from
Sep 19, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Suggests:
covr
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.3.1
RoxygenNote: 7.3.2
Config/testthat/edition: 3
Depends:
R (>= 2.10)
5 changes: 5 additions & 0 deletions R/coro.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@ the <- new.env()
#' A generator for making request to OpenAlex API
#' Returns one record at a time.
#'
#' @section Details:
#' For the case without group_by, setting `verbose = TRUE` when you call `oar()`
#' for the first time will give the total number of records in the message.
#' With group_by, unfortunately we don't have a way to know the total until we query until exhausted.
#'
#' @param ... arguments passed to the generator including
#' `query_url`, `mailto`, `api_key`, and `verbose`.
#' See `oa_request` for details on these arguments.
Expand Down
7 changes: 7 additions & 0 deletions man/oa_generate.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 6 additions & 2 deletions tests/testthat/test-oa_fetch.R
Original file line number Diff line number Diff line change
Expand Up @@ -50,11 +50,15 @@ test_that("oa_fetch works", {
expect_true("au_affiliation_raw" %in% names(multi_works$author[[1]]))

Sys.sleep(1 / 10)
expect_warning(filtered_works <- oa_fetch(
# warn about truncated authors
expect_warning(oa_fetch(identifier = c("W4381194940", "W4386241859")))

Sys.sleep(1 / 10)
filtered_works <- oa_fetch(
entity = "w",
publication_date = "2020-08-01",
cited_by_count = ">1000"
))
)
expect_s3_class(filtered_works, "data.frame")

expect_warning(oa_fetch(doi = "123"))
Expand Down
4 changes: 3 additions & 1 deletion tests/testthat/test-oa_ngrams.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
test_that("oa_ngrams works", {
skip(message = "Skipping") # TODO to remove

skip_on_cran()

# Single paper ngram search ####
works_multi <- c("W1963991285")
works_multi <- c("W2317271409")
ngram_multi <- oa_ngrams(works_multi)

expect_equal(nrow(ngram_multi), length(works_multi))
Expand Down
Binary file added vignettes/articles/images/openalex-web.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
178 changes: 178 additions & 0 deletions vignettes/articles/performance-optimization.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
---
title: "Performance and optimization"
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

While `oa_fetch()` offers a convenient and flexible way of retrieving results from queries to the OpenAlex API, its defaults may not be best suited for heavier workflows that involve fetching records in the magnitude of tens or hundreds of thousands of entities.

Optimizing the performance of such large queries benefits greatly from being intentional and specific about what kinds of information you care about, and making assumptions that let you safely take shortcuts around the defaults.

This vignette discusses three strategies for for optimizing performance of large queries:

- The parameter `options = list(select = ...)` in `oa_fetch()`
- The argument `output = "list"` in `oa_fetch()`
- The function `oa_generate()`

```{r setup, message=FALSE}
library(openalexR)
library(dplyr)
```

## The `select` strategy

The `options` argument of `oa_fetch()` specifies a list of additional parameters to add to the query, such as `select`, `sort`, and `sample`.
Of these, `select` can be used to specify which fields of the entities are to be returned by OpenAlex.
By specifying only the kinds of information about entities that you care about, you can reduce the overall size of the query result, which will in turn speed up the processing of the raw JSON and converting of the JSON into a data frame.

For example, suppose that we are looking for a sample of works from the [Topic](https://docs.openalex.org/api-entities/topics) of [Language Development and Acquisition in Children](https://openalex.org/topics/T10730) (`"T10730"`).

```{r}
language_development <- oa_fetch(
entity = "topics",
search = "Language Development and Acquisition in Children"
)[1,1:2]
language_development
```

To sample some papers from this topic, we can use the `topics.id` [filter](https://docs.ropensci.org/openalexR/articles/Filters.html) and set `options = list(sample = 5, seed = 1)` to return a random set of five [Works](https://docs.openalex.org/api-entities/works) entities with a reproducible seed:

```{r}
oa_fetch(
entity = "works",
topics.id = language_development$id,
options = list(sample = 5, seed = 1)
) %>%
show_works()
```

In OpenAlex, entities have a set of **fields** which represent various information about them.
These are typically returned as data frame columns by `oa_fetch()`, and the full list of fields can be found in the API documentation for each entity.
For example, [the fields in a Works object](https://docs.openalex.org/api-entities/works/work-object) contain information such as `id`, `display_name`, `authorships`, and so on.

If we only cared about the above three fields from our sample of papers, we can simplify specify those fields in the `select` parameters of the `options` list of arguments:

```{r}
oa_fetch(
entity = "works",
topics.id = language_development$id,
options = list(sample = 5, seed = 1,
select = c("id", "display_name", "authorships"))
)
```

This returns the scalar fields `id` and `display_name` in their appropriate data types (character) in the dataframe.
Additionally, the `authorships` field has been further processed as a list-column of data frames, to fit nicely into the "tidy" data frame structure.

Specifying the desired fields up front in this way is not only convenient but also more performant, as there will be less data for `oa_fetch()` to process.

## The `output = "list"` strategy

By default, `oa_fetch()` uses `output = "tibble"`, which returns a processed `tibble` data frame of the results.
In such cases, the JSON response from OpenAlex is first converted to an R list, then a data frame via `oa2df()`, which calls the appropriate conversion implementation depending on the type of entity being processed (e.g., `works2df()` for Works entities).

A lot of care goes into `oa2df()` to return a compact, tidy-data representation of query results.
But these operations can become a bottleneck to performance at scale, and so sometimes you may want to opt out of this automatic data frame conversion.

To do so in `oa_fetch()`, you can set `output = "list"`, which will simply return the R list corresponding to the JSON response.

```{r}
output_list <- oa_fetch(
entity = "works",
topics.id = language_development$id,
options = list(sample = 5, seed = 1),
output = "list"
)
str(output_list, max.level = 1)
```

The list output is quite unruly - each entity returns dozens of fields, some of which may be multiply nested.
Moreover, some entities have missing fields entirely.

One advantage of returning the output as a list is that you can always come back to process them as data frames later.
Instead of retrieving *and* converting the results simultaneously, which may stress `oa_fetch()` for large queries, you can retrieve all the results first *and then* convert them after the fact.

In our case, the Works entities can be processed with `works2df()` (or more generally, `oa2df(entity = "works")`), which returns a data frame identical to what we saw at the start with the default `output = "tibble"`:

```{r}
works2df(output_list) %>%
show_works()
```

Lastly, opting out of the data frame conversion means that you can use your own implementation for converting the list output.
This can be a very powerful optimization strategy when combined with the `select` option.
For example, if you know that you are only selecting scalar fields, you can very quickly convert the list output into a tidy data frame using more powerful tools like `data.table::rbindlist()` or even just `rbind()`:

```{r}
oa_fetch(
entity = "works",
topics.id = language_development$id,
options = list(sample = 5, seed = 1,
select = c("id", "display_name", "cited_by_count")),
output = "list"
) %>%
do.call(rbind.data.frame, .) %>%
as_tibble()
```

## The `oa_generate()` strategy

If your code still seems slow, it is possible you run out of memory (especially when you do a snowball search like with `oa_snowball`).
So it might help to chunk your work and save the output of each step, then piece them back together later in a different session/program.[^1]

[^1]: Determining whether memory is sufficient to store all records at the beginning is not straightforward.
We believe the (power) user should be in charge of this step, combining \`count_only = TRUE\` with what their knowledge of their machine's RAM specs to determine whether they should write out the stepwise results themselves.
Related GH issues: [here](https://github.com/ropensci/openalexR/issues/129#issuecomment-1642664363) and [here](https://github.com/ropensci/openalexR/issues/95#issuecomment-1513964112).

The `oa_generate()` function is a lower-level function that allows you to process one record at a time.
This way, you can process records in batches of, say, 1000 records, and write them out to disk as you go along.[^2]

[^2]: Quick note: For the case without `group_by`, setting `verbose = TRUE` when you call `oar()` for the first time will give the total number of records in the message.
With `group_by`, unfortunately we don't have a way to know the total until we query until exhausted.
Also, you will need to have the R package [**coro**](https://github.com/r-lib/coro) installed to use `oa_generate()` (`install.packages("coro")`).

In the example below, we show how `oa_generate()` works when we want to find all the works that cite [W1160808132](https://openalex.org/works/W1160808132).

```{r}
query_url <- "https://api.openalex.org/works?filter=cites%3AW1160808132"
trangdata marked this conversation as resolved.
Show resolved Hide resolved
oar <- oa_generate(query_url, verbose = TRUE)
p1 <- oar() # record 1
p2 <- oar() # record 2
p3 <- oar() # record 3
head(p1)
head(p3)
```

As you see, each record returned by `oa_generate` is a list of fields belonging to a work, parsed from the JSON response from OpenAlex.
You can process these records as you see fit, such as writing them out as *.rds* files in batches of 100 records.

```{r eval=FALSE}
query_url <- "https://api.openalex.org/works?filter=cites%3AW1160808132"
oar <- oa_generate(query_url)
n <- 100
recs <- vector("list", n)
i <- 0

coro::loop(for (x in oar) {
j <- i %% n + 1
recs[[j]] <- x
if (j == n) {
saveRDS(recs, sprintf("rec-%s.rds", i %/% n))
recs <- vector("list", n) # reset recs
}
i <- i + 1
})
```

### Tips on generating the query URL to the OpenAlex API

You can certainly build your query using `oa_query()` and carefully read the [API documentation](https://docs.openalex.org/) to see what fields/filters are available.
However, I sometimes find it super helpful to use the OpenAlex web interface to build the query [interactively](https://openalex.org/works?page=1&filter=authorships.author.id%3Aa5086928770,open_access.is_oa%3Atrue&view=list,report,api).
Make sure you select the Gear icon on the right and toggle on "Api query".

![](images/openalex-web.png)
Loading