-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using duckDB via json to create tibble from OpenAlex return values #275
Comments
I would actually suggest to only include the changes in the function api_request dealing with the saving of the raw json and just add tat to the In other words, keep only the changes to the call of |
After some more optimisations, I am at 103 seconds for |
I have removed unnecessary changes for this approach. |
I just realised that the saved jsons can be directly used in VOSViewer as data input - so they do not need to be downloaded again. |
As discussed, I created a repo which shows how one could implement a pipeline which uses the DuckDB R-API to create a tibble from the raw json files.
The repo is calles OpenAlex_json. It contains some functions (two modified ones from openalexR) and two more functions to read the data in the raw json files as a
tibble
or convert it into aparquet
dataset, partitioned bypublication_year
. I have included some basic timing info and comparison between the two approaches and the one via DuckDB is, for 17.000 records, about 20 seconds faster thanoa_fetch(output = "kibble")
.See here for the report.
The main advantage is, apart of very much lower memory needs and faster timing, that it simply does uses the structure returned from OpenAlex, therefore all changes are reflected immediately without any further maintenance. Convenience functions could be added to create backward compatible output, specific formats, etc which can be done using SQL in DuckDB or possibly even dplyr pipelines.
The text was updated successfully, but these errors were encountered: