Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using duckDB via json to create tibble from OpenAlex return values #275

Open
rkrug opened this issue Sep 11, 2024 · 4 comments
Open

Using duckDB via json to create tibble from OpenAlex return values #275

rkrug opened this issue Sep 11, 2024 · 4 comments

Comments

@rkrug
Copy link

rkrug commented Sep 11, 2024

As discussed, I created a repo which shows how one could implement a pipeline which uses the DuckDB R-API to create a tibble from the raw json files.

The repo is calles OpenAlex_json. It contains some functions (two modified ones from openalexR) and two more functions to read the data in the raw json files as a tibble or convert it into a parquet dataset, partitioned by publication_year. I have included some basic timing info and comparison between the two approaches and the one via DuckDB is, for 17.000 records, about 20 seconds faster than oa_fetch(output = "kibble").

See here for the report.

The main advantage is, apart of very much lower memory needs and faster timing, that it simply does uses the structure returned from OpenAlex, therefore all changes are reflected immediately without any further maintenance. Convenience functions could be added to create backward compatible output, specific formats, etc which can be done using SQL in DuckDB or possibly even dplyr pipelines.

@rkrug
Copy link
Author

rkrug commented Sep 11, 2024

I would actually suggest to only include the changes in the function api_request dealing with the saving of the raw json and just add tat to the oa_request() and remove everything I added concerning the saving of the results in oa_request() directly as this is not needed anymore.

In other words, keep only the changes to the call of api_request()

@rkrug
Copy link
Author

rkrug commented Sep 11, 2024

After some more optimisations, I am at 103 seconds for oa_fetch() versus 59 seconds via json and DuckDB to get a tibble.

@rkrug
Copy link
Author

rkrug commented Sep 12, 2024

I have removed unnecessary changes for this approach.

@rkrug
Copy link
Author

rkrug commented Sep 16, 2024

I just realised that the saved jsons can be directly used in VOSViewer as data input - so they do not need to be downloaded again.

@rkrug rkrug changed the title Using duckDB via json to create kibble from OpenAlex return values Using duckDB via json to create tibble from OpenAlex return values Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant