Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(app): #148 Export data from scraped files #150

Merged
merged 4 commits into from
Feb 23, 2024

Conversation

AlexAxthelm
Copy link
Collaborator

@AlexAxthelm AlexAxthelm commented Feb 18, 2024

Changes in this PR export data from scraped sources to file, so that all sources for a dataset are captured, and the dataset can be recreated later.

By default, exports to data prep outputs path, but can be specified with preflight_data_path. Writing to the outputs path by default, since this matches current behavior, and we do not have a unified inputs path that we are currently using.

Closes: #148

@AlexAxthelm AlexAxthelm requested a review from jdhoffa February 18, 2024 16:18
AlexAxthelm added a commit that referenced this pull request Feb 18, 2024
adjust code to account for changes made in #150
This was referenced Feb 18, 2024
jdhoffa
jdhoffa previously approved these changes Feb 19, 2024
Copy link
Member

@jdhoffa jdhoffa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable?

@cjyetman
Copy link
Member

superseded by #158

@cjyetman cjyetman closed this Feb 22, 2024
cjyetman added a commit that referenced this pull request Feb 22, 2024
@AlexAxthelm
Copy link
Collaborator Author

Not superseded by #158.

The difference her is that we are also defining a preflight_data_path (which currently happens to default to output_data_path, but coming back to it, I'm thinking we should be using the inputs path instead) and writing the files there (whereas in 158, it's always outputs path).

Since these files are used as part of the process (and not just reexported), they are technically inputs, and should be tracked as both inputs and outputs.

@AlexAxthelm AlexAxthelm reopened this Feb 22, 2024
@cjyetman
Copy link
Member

cjyetman commented Feb 22, 2024

please adjust the PR title and description to reflect its purpose

@AlexAxthelm
Copy link
Collaborator Author

@cjyetman I believe this is ready for review as-is. If you have specific changes you would like to either the description or title, let me know what they should read as.

@cjyetman
Copy link
Member

I don't understand the goal of this PR as-is, and I'm not in favor of adding a new output directory, one that was previously removed for simplification.

@AlexAxthelm
Copy link
Collaborator Author

if (preflight_data_path == "") {
preflight_data_path <- data_prep_outputs_path
}

By default, this does not add any new directories to the structure. It uses the existing output directory (unless overriden in the config), but exports currencies.rdsand the index_regions.rds to enhance reproducibility.

If you have a different default path you would like to use, I'm open to suggestions, but this one minimizes the changes we need to make for other parts of the system.

@cjyetman
Copy link
Member

I happy with the default path, I'm not in favor of adding a new optional path which then requires continued support and consideration, especially without any reasonable justification for it.

@AlexAxthelm
Copy link
Collaborator Author

The idea for that is that when running locally, or during development, or during a reconstruction of a dataset, you can specify that path to use the downloaded files (wherever they reside on the host machine). Without that, there's no good way to specify what path to use for those files.

@cjyetman
Copy link
Member

Yeah, that's why I'm not thrilled with the idea of having timestamped output directories. My gut feeling would be to go back to saving it to the inputs directory and then also copy it over to the outputs directory, which was just changed in #127. I was preemptively avoiding a potential problem with the inputs directory being read-only with that, but I don't know if that's really a concern. If not, probably best to just revert #127.

@AlexAxthelm
Copy link
Collaborator Author

I think the big difference between this and reverting #127 is that here we're only assuming that the outputs path is writable, and we can preserve that behavior, while still allowing for the split between the FactSet and AI data.

@cjyetman
Copy link
Member

I think the big difference between this and reverting #127 is that here we're only assuming that the outputs path is writable, and we can preserve that behavior, while still allowing for the split between the FactSet and AI data.

Is this a plausible concern, that the inputs directory might be read-only on Azure? I only imagined that possibility, but not sure if it's real. If so, then yeah, we're going to have to add a new directory 🤔

@AlexAxthelm
Copy link
Collaborator Author

Is this a plausible concern, that the inputs directory might be read-only on Azure?

Not just a plausible concern, but a definite fact.

I'd rather avoid having a directory there with just the scraped inputs though. My main goal with this PR is to have a way to say "from this set of inputs, do we get the same set of outputs." Not "If we lose all the outputs, can we recreate them from scratch.

@cjyetman
Copy link
Member

ok, so I guess we have to do this then... can we make the setting of the preflight_path conditional on update_currencies since they're interdependent? I'm concerned about having a config with update_currencies = TRUE and preflight_path = "/some/path" and then there being some unexpected behavior

to understand the logic...

  1. if update_currencies = TRUE and preflight_path = "/some/path" - currencies are scraped and saved to specified preflight_path and then copied to specified data_prep_outputs_path
  2. if update_currencies = TRUE and preflight_path = "" - currencies are scraped and saved to specified data_prep_outputs_path only
  3. if update_currencies = FALSE and preflight_path = "/some/path" - currencies are not scraped but read in from "preflight_path/currencies.rds" and copied to specified data_prep_outputs_path
  4. if update_currencies = FALSE and preflight_path = "" - currencies are not scraped but read in from "preflight_path/currencies.rds" and saved to specified data_prep_outputs_path only

Does that sound right?

@AlexAxthelm
Copy link
Collaborator Author

ok, so I guess we have to do this then... can we make the setting of the preflight_path conditional on update_currencies since they're interdependent? I'm concerned about having a config with update_currencies = TRUE and preflight_path = "/some/path" and then there being some unexpected behavior

to understand the logic...

  1. if update_currencies = TRUE and preflight_path = "/some/path" - currencies are scraped and saved to specified preflight_path and then copied to specified data_prep_outputs_path
  2. if update_currencies = TRUE and preflight_path = "" - currencies are scraped and saved to specified data_prep_outputs_path only
  3. if update_currencies = FALSE and preflight_path = "/some/path" - currencies are not scraped but read in from "preflight_path/currencies.rds" and copied to specified data_prep_outputs_path
  4. if update_currencies = FALSE and preflight_path = "" - currencies are not scraped but read in from "preflight_path/currencies.rds" and saved to specified data_prep_outputs_path only

Does that sound right?

Mostly.

  1. if update_currencies = FALSE and preflight_path = "" - currencies are not scraped but read in from "data_prep_outputs_path/currencies.rds" **which will (probably) cause an error on read, since the file (probably) doesn't exist.

I think a lot of your concerns about interdependency are addressed in #149, where all the input and preflight filepaths are checked for existence (if needed) prior to moving forwards.

@cjyetman
Copy link
Member

fine, I prefer to have relevant checks included where they became necessary, but these are edge cases and we'll get to it soon

@AlexAxthelm AlexAxthelm merged commit 5b2617e into main Feb 23, 2024
@AlexAxthelm AlexAxthelm deleted the feat/148-export-scraped-inputs branch February 23, 2024 12:58
AlexAxthelm added a commit that referenced this pull request Feb 23, 2024
Update run_data_preparation to use changed function in
https://github.com/RMI-PACTA/pacta.data.preparation/pull/341.

Also simplifies somewhat the logic around the list of input filepaths.

Should be merged after that, but coordinated closely.

**Note:** as-written, this includes #150, but reverting back to
[35fd720](35fd720)
removes that dependency.

Also as written, depends on #153, as this does a similar simplification
from `list.files()` to explicit filepath lists for the archive export
feature defined in that PR. Reverting
[1b8cd49](1b8cd49)
removes that connection.

- [x] Depends on #150 
- [x] Depends on #153 

Closes #147

---------

Co-authored-by: CJ Yetman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Export copies of scraped data
3 participants