Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Convert to Package infrastructure #77

Closed
AlexAxthelm opened this issue Jan 19, 2024 · 9 comments
Closed

Proposal: Convert to Package infrastructure #77

AlexAxthelm opened this issue Jan 19, 2024 · 9 comments

Comments

@AlexAxthelm
Copy link
Collaborator

I've had pretty good success with using an R package infrastructure with {workflow.factset}, since it makes it pretty easy to maintain sections of the workflow as individual functions (which is nice since dealing with garbage collection is less of an issue, no more calls to rm() to free up memory), but it also makes things easier for dependency management and if there are functions that should be tested, testing them, and documentation.

The general idea is that you have a primary function that is meant to be called:

run_pacta_data_prep <- function(cfg){
  check_all_the_files_are_there(cfg$inputs)
  export_currencies(cfg$currencies_path)
  export_index_regions(cfg$regions_path)
  export_financial_data(cfg$factset_financial_data_path, cfg$outputs_path)
  export_blah_blah_blah()
  check_everything_exported_properly()
  put_it_all_in_a_zip_file()
}

and then define whatever behaviors you need in each child file.

then getting everything ready to go is pretty straightforward for preparing the code, and just calling workflow.data.preparation::run_pacta_data_prep(config = "/path/to/config.yml") or something like that. A bit of a simplification, but overall not a terrible conversion.

cc @cjyetman @jdhoffa

@jdhoffa
Copy link
Member

jdhoffa commented Jan 19, 2024

I have had this thought a few times in the past tbh. Not 100% opposed to it, but also a LITTLE apprehensive so having a huge wrapper function that is called for it's 10000 side-effects

Good topic for tech review?

@AlexAxthelm
Copy link
Collaborator Author

AlexAxthelm commented Jan 21, 2024

@jdhoffa @cjyetman to light a fire under this, memory management is kind of a huge issue for data prep, since when I try to run on a machine with 16GB of ram, it's killing with out-of-memory errors (while connecting scnearios to ABCD). Not saying that making a package is a cureall for that, but I think it would help

(context: 16GB is the max memory for a standard instance of Azure Container Instances)

@jdhoffa
Copy link
Member

jdhoffa commented Jan 21, 2024

I don't think it would help, the memory intensive step is all already wrapped in a function I believe.

@jdhoffa
Copy link
Member

jdhoffa commented Jan 21, 2024

Unless there's something I'm missing?

@cjyetman
Copy link
Member

I don't see how "converting to package infrastructure" would solve any memory problems, but it's an interesting experiment.

@AlexAxthelm
Copy link
Collaborator Author

Overall, I think all the calls to rm() are a pretty big code smell (and that we're leaving some objects on the table that don't need to be there), and letting R's garbage collection deal with cleaning up when a function exits are preferable.

It's entirely possible that I'm wrong on this one, but I think that even if it doesn't solve the problems, we have enough repeated code (sqlite export, for example) that it's probably worth exploring

@jdhoffa
Copy link
Member

jdhoffa commented Jan 22, 2024

Yeah that's fair, having a main() function seems like an easy next step, that doesn't require us going all the way to Package infrastructure, but allows us to get rid of the rm() calls, which I tend to agree is an improvement.

@cjyetman
Copy link
Member

cjyetman commented Jan 22, 2024

You could simply wrap things in curly braces and let R do the rms for you if it makes you feel more confident about it.

Personally I think the explicitness is better.

@cjyetman
Copy link
Member

This has been partially achieved by adding a DESCRIPTION file in #83, but I doubt we will go any further with this since we will likely rather follow the suggestion in #94 to refactor much of the code in run_pacta_data_preparation.R to pacta.data.preparation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants