-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(app): #148 Export data from scraped files #150
Conversation
adjust code to account for changes made in #150
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable?
superseded by #158 |
Not superseded by #158. The difference her is that we are also defining a Since these files are used as part of the process (and not just reexported), they are technically inputs, and should be tracked as both inputs and outputs. |
please adjust the PR title and description to reflect its purpose |
@cjyetman I believe this is ready for review as-is. If you have specific changes you would like to either the description or title, let me know what they should read as. |
I don't understand the goal of this PR as-is, and I'm not in favor of adding a new output directory, one that was previously removed for simplification. |
workflow.data.preparation/run_pacta_data_preparation.R Lines 100 to 102 in 94fe89f
By default, this does not add any new directories to the structure. It uses the existing output directory (unless overriden in the config), but exports If you have a different default path you would like to use, I'm open to suggestions, but this one minimizes the changes we need to make for other parts of the system. |
I happy with the default path, I'm not in favor of adding a new optional path which then requires continued support and consideration, especially without any reasonable justification for it. |
The idea for that is that when running locally, or during development, or during a reconstruction of a dataset, you can specify that path to use the downloaded files (wherever they reside on the host machine). Without that, there's no good way to specify what path to use for those files. |
Yeah, that's why I'm not thrilled with the idea of having timestamped output directories. My gut feeling would be to go back to saving it to the inputs directory and then also copy it over to the outputs directory, which was just changed in #127. I was preemptively avoiding a potential problem with the inputs directory being read-only with that, but I don't know if that's really a concern. If not, probably best to just revert #127. |
I think the big difference between this and reverting #127 is that here we're only assuming that the outputs path is writable, and we can preserve that behavior, while still allowing for the split between the FactSet and AI data. |
Is this a plausible concern, that the inputs directory might be read-only on Azure? I only imagined that possibility, but not sure if it's real. If so, then yeah, we're going to have to add a new directory 🤔 |
Not just a plausible concern, but a definite fact. I'd rather avoid having a directory there with just the scraped inputs though. My main goal with this PR is to have a way to say "from this set of inputs, do we get the same set of outputs." Not "If we lose all the outputs, can we recreate them from scratch. |
ok, so I guess we have to do this then... can we make the setting of the to understand the logic...
Does that sound right? |
Mostly.
I think a lot of your concerns about interdependency are addressed in #149, where all the input and preflight filepaths are checked for existence (if needed) prior to moving forwards. |
fine, I prefer to have relevant checks included where they became necessary, but these are edge cases and we'll get to it soon |
Update run_data_preparation to use changed function in https://github.com/RMI-PACTA/pacta.data.preparation/pull/341. Also simplifies somewhat the logic around the list of input filepaths. Should be merged after that, but coordinated closely. **Note:** as-written, this includes #150, but reverting back to [35fd720](35fd720) removes that dependency. Also as written, depends on #153, as this does a similar simplification from `list.files()` to explicit filepath lists for the archive export feature defined in that PR. Reverting [1b8cd49](1b8cd49) removes that connection. - [x] Depends on #150 - [x] Depends on #153 Closes #147 --------- Co-authored-by: CJ Yetman <[email protected]>
Changes in this PR export data from scraped sources to file, so that all sources for a dataset are captured, and the dataset can be recreated later.
By default, exports to data prep outputs path, but can be specified with
preflight_data_path
. Writing to the outputs path by default, since this matches current behavior, and we do not have a unified inputs path that we are currently using.Closes: #148