Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and optimization #132

Merged
merged 11 commits into from
Oct 26, 2023
Merged

Refactor and optimization #132

merged 11 commits into from
Oct 26, 2023

Conversation

trangdata
Copy link
Collaborator

@trangdata trangdata commented Jul 19, 2023

This will need more extensive testing...

Some cleanup and optimization so far:

  • remove the use of simple_rapply
  • options as new argument to oa_snowball

Related: #129

@trangdata
Copy link
Collaborator Author

Re #127: the user can now set the environment variable openalexR.print to the number of characters in the printed query to shorten very long URLs:

library(openalexR)

w <- function() {
  oa_fetch(
    entity = "works",
    title.search = c("bibliometric analysis", "science mapping"),
    cited_by_count = ">50",
    options = list(select = "id"),
    from_publication_date = "2021-01-01",
    to_publication_date = "2021-12-31",
    verbose = TRUE
  )
}

w0 <- w()
#> Requesting url: https://api.openalex.org/works?filter=title.search%3Abibliometric%20analysis%7Cscience%20mapping%2Ccited_by_count%3A%3E50%2Cfrom_publication_date%3A2021-01-01%2Cto_publication_date%3A2021-12-31&select=id
#> Getting 1 page of results with a total of 63 records...
Sys.setenv(openalexR.print = 70)
w1 <- w()
#> Requesting url: https://api.openalex.org/works?filter=title.search%3Abibliometric%20an...
#> Getting 1 page of results with a total of 63 records...
Sys.unsetenv("openalexR.print")
w2 <- w()
#> Requesting url: https://api.openalex.org/works?filter=title.search%3Abibliometric%20analysis%7Cscience%20mapping%2Ccited_by_count%3A%3E50%2Cfrom_publication_date%3A2021-01-01%2Cto_publication_date%3A2021-12-31&select=id
#> Getting 1 page of results with a total of 63 records...

Created on 2023-10-24 with reprex v2.0.2

@trangdata
Copy link
Collaborator Author

trangdata commented Oct 24, 2023

Re #129: Previously, oa_snowball can take a long time. This refactor removes the use of simple_rapply and makes some improvement on speed.

Previously:

library(openalexR)
packageVersion("openalexR")
#> [1] '1.2.2'
w <- oa_fetch("works", options = list(sample = 20, seed = 1, select = "id"))
myids <- openalexR:::shorten_oaid(w$id)
system.time({
  ilk_snowball <- oa_snowball(
    identifier = myids,
    verbose = TRUE
  )
})
#> Requesting url: https://api.openalex.org/works?filter=openalex%3AW2752822653%7CW2057540892%7CW2071641039%7CW2528237503%7CW4255644834%7CW2039776320%7CW1998173837%7CW2894916677%7CW4205808956%7CW4292916519%7CW2210922255%7CW2123690481%7CW2074469351%7CW4378553964%7CW2321856033%7CW2439084087%7CW2294799430%7CW2966056779%7CW1424334985%7CW2425037722
#> Getting 1 page of results with a total of 20 records...
#> Collecting all documents citing the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cites%3AW2071641039%7CW1998173837%7CW2057540892%7CW2321856033%7CW4205808956%7CW2210922255%7CW2294799430%7CW2425037722%7CW2123690481%7CW1424334985%7CW2039776320%7CW2074469351%7CW2439084087%7CW2528237503%7CW2752822653%7CW2894916677%7CW2966056779%7CW4255644834%7CW4292916519%7CW4378553964
#> Getting 2 pages of results with a total of 324 records...
#> Collecting all documents cited by the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cited_by%3AW2071641039%7CW1998173837%7CW2057540892%7CW2321856033%7CW4205808956%7CW2210922255%7CW2294799430%7CW2425037722%7CW2123690481%7CW1424334985%7CW2039776320%7CW2074469351%7CW2439084087%7CW2528237503%7CW2752822653%7CW2894916677%7CW2966056779%7CW4255644834%7CW4292916519%7CW4378553964
#> Getting 1 page of results with a total of 135 records...
#>    user  system elapsed 
#>   3.672   0.060  11.042

Now:

library(openalexR)
packageVersion("openalexR")
#> [1] '1.2.2.9999'
Sys.setenv(openalexR.print = 70)
w <- oa_fetch("works", options = list(sample = 20, seed = 1, select = "id"))
myids <- openalexR:::shorten_oaid(w$id)
system.time({
  ilk_snowball <- oa_snowball(
    identifier = myids,
    verbose = TRUE
  )
})
#> Requesting url: https://api.openalex.org/works?filter=openalex%3AW2752822653%7CW205754...
#> Getting 1 page of results with a total of 20 records...
#> Collecting all documents citing the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cites%3AW2071641039%7CW199817383...
#> Getting 2 pages of results with a total of 324 records...
#> Collecting all documents cited by the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cited_by%3AW2071641039%7CW199817...
#> Getting 1 page of results with a total of 135 records...
#>    user  system elapsed 
#>   2.089   0.049   4.103

We can also make it a little faster by specifying the fields we want in oa_snowball with options = list(select = c("id", "display_name", "authorships", "referenced_works")). Note that in the newest implementation, we allow different options for the core papers, the citing papers and the cited_by papers. Therefore, one will need to specify these options separately like so:

library(openalexR)
packageVersion("openalexR")
#> [1] '1.2.2.9999'
Sys.setenv(openalexR.print = 70)
w <- oa_fetch("works", options = list(sample = 20, seed = 1, select = "id"))
myids <- openalexR:::shorten_oaid(w$id)
my_opts <- list(select = c("id", "display_name", "authorships", "referenced_works"))
system.time({
  ilk_snowball <- oa_snowball(
    identifier = myids,
    options = my_opts,
    citing_params = list(options = my_opts),
    cited_by_params = list(options = my_opts),
    verbose = TRUE
  )
})
#> Requesting url: https://api.openalex.org/works?filter=openalex%3AW2752822653%7CW205754...
#> Getting 1 page of results with a total of 20 records...
#> Collecting all documents citing the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cites%3AW2071641039%7CW199817383...
#> Getting 2 pages of results with a total of 324 records...
#> Collecting all documents cited by the target papers...
#> Requesting url: https://api.openalex.org/works?filter=cited_by%3AW2071641039%7CW199817...
#> Getting 1 page of results with a total of 135 records...
#>    user  system elapsed 
#>   0.898   0.016   2.075

Created on 2023-10-24 with reprex v2.0.2

@trangdata trangdata marked this pull request as ready for review October 24, 2023 14:17
@rkrug
Copy link

rkrug commented Oct 24, 2023

The specification of the fields seems to make a huge difference. Great.

R/oa_fetch.R Outdated Show resolved Hide resolved
Copy link
Collaborator

@yjunechoe yjunechoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left two small suggestions (resolve as you wish) for printing the shortened URL when the option is set.

Also this is unrelated, but when I ran devtools::test() on this PR, one test threw a warning

test_that("oa_request returns list", {
skip_on_cran()
query_url <- paste0(
"https://api.openalex.org/authors?",
"filter=openalex%3AA923435168%7CA2208157607"
)
expect_type(oa_request(query_url), "list")
})

#> Warning message:
#> In oa_request(query_url) : No records found!

Co-authored-by: June Choe <[email protected]>
@trangdata
Copy link
Collaborator Author

trangdata commented Oct 26, 2023

one test threw a warning

Oh man did OpenAlex change its author IDs again? I'll check. All tests ran fine two days ago so I'm not sure why A2208157607 and A923435168 are no longer valid author ids. Hmm... so I think what happened is that I wasn't thorough enough in my update of author IDs in #167. Will update these IDs now.

@trangdata trangdata merged commit 056e55c into main Oct 26, 2023
9 checks passed
@trangdata trangdata deleted the oa-snowball branch October 26, 2023 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants