Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save raw json to enable download of large number of records #276

Closed
wants to merge 3 commits into from

Conversation

rkrug
Copy link

@rkrug rkrug commented Sep 16, 2024

This pull request adds a variable (json_dir) to the function api_request() and oa_request(). If set, the raw json files as returned by the call, per page, to OA and no further processing is done.

This makes the download of a large number of records possible, which would not have been possible by using the current approach.
In addition, it enables power users to process the json according to their needs, while not visible to the casual user, as this argument is not available in oa_fetch().
This ala=so enables an efficient way of doing the conversion into a nibble as demonstrated at https://github.com/rkrug/openalexr_json/blob/b67560708abf711fdc0e5c9b26c2327e12b7cc1b/R/json_to_tibble.R. At https://rkrug.github.io/openalexr_json/README.html you can see a timing comparison.

An additional bonus is, that the json files can be read directly into VOSViewer which makes it possible to link openalexR directly to further analysis with VOSViewer.

@rkrug rkrug closed this Sep 17, 2024
@rkrug
Copy link
Author

rkrug commented Sep 17, 2024

The checks do now complete successfully.

@rkrug rkrug reopened this Sep 17, 2024
@rkrug rkrug changed the title Safe raw json to enable download of large number of records Save raw json to enable download of large number of records Sep 22, 2024
@rkrug
Copy link
Author

rkrug commented Sep 22, 2024

Could I get some feedback from you side if you are interested in this approach of making a minimal set of changes and opening the use of openalexR for other packages for advanced use cases?

Please see https://github.com/rkrug/openalexPro/tree/main as of what I am thinking about. The package uses using the downloaded raw jsons to save them to parquet format, extract the countries. The aim is to move most of the heavy lifting to duckdb which is extremely efficient with these kind of requests.

At the moment I am using

   utils::assignInNamespace("api_request", api_request, ns = "openalexR")
   utils::assignInNamespace("oa_request", oa_request, ns = "openalexR")

to apply the changes to opanalexR so that the raw sons are saved, but it is a completely unsatisfying solution as it might need adjustment after each release of openalexR.

I would very much appreciate if you could let me know if you are interested in committing these minor changes into openalexR. If not, I would go ahead with the approach I have at the moment.

Thanks for your consideration,

Rainer

@yjunechoe
Copy link
Collaborator

Thanks for this example - I see the usefulness of giving users access to the raw JSON and the workflow you outlined in the document makes good sense to me.

But from a development and maintenance perspective, I think this implementation (of taking a json_dir argument to write out json files to disk) captures the spirit of that a bit too narrowly. For one, I would prefer not to bake in a side-effect behavior to a core function catered to the average user for everyday use. And more critically, not all power users would appreciate the round-about setup and consequence of creating a directory and writing to disk when the raw response could've been made available in R first.

So IMO this calls for a simple support for output="raw", as we discussed previously. I've implemented this in a new PR #280 and will soon follow-up there with an example of how that works with your duckdb workflow. Any behaviors built on top of the raw json strings (including writing them out) should be maintained by the power users who prefer to opt out of the core features of the package with their own custom processing workflows.

@rkrug
Copy link
Author

rkrug commented Sep 25, 2024

Thanks for your feedback. To not fragment the discussion, I will close this p[ull request and continue the discussion in https://github.com//pull/280

@rkrug rkrug closed this Sep 25, 2024
@rkrug rkrug mentioned this pull request Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants