feat(app): #148 Export data from scraped files #150

AlexAxthelm · 2024-02-18T16:18:05Z

Changes in this PR export data from scraped sources to file, so that all sources for a dataset are captured, and the dataset can be recreated later.

By default, exports to data prep outputs path, but can be specified with preflight_data_path. Writing to the outputs path by default, since this matches current behavior, and we do not have a unified inputs path that we are currently using.

Closes: #148

adjust code to account for changes made in #150

jdhoffa

Seems reasonable?

cjyetman · 2024-02-22T10:03:49Z

superseded by #158

supersedes #150 closes #148

AlexAxthelm · 2024-02-22T11:02:35Z

Not superseded by #158.

The difference her is that we are also defining a preflight_data_path (which currently happens to default to output_data_path, but coming back to it, I'm thinking we should be using the inputs path instead) and writing the files there (whereas in 158, it's always outputs path).

Since these files are used as part of the process (and not just reexported), they are technically inputs, and should be tracked as both inputs and outputs.

cjyetman · 2024-02-22T11:03:45Z

please adjust the PR title and description to reflect its purpose

AlexAxthelm · 2024-02-23T10:28:29Z

@cjyetman I believe this is ready for review as-is. If you have specific changes you would like to either the description or title, let me know what they should read as.

cjyetman · 2024-02-23T10:31:16Z

I don't understand the goal of this PR as-is, and I'm not in favor of adding a new output directory, one that was previously removed for simplification.

AlexAxthelm · 2024-02-23T10:37:06Z

workflow.data.preparation/run_pacta_data_preparation.R

Lines 100 to 102 in 94fe89f

    
           if (preflight_data_path == "") { 
        
             preflight_data_path <- data_prep_outputs_path 
        
           }

By default, this does not add any new directories to the structure. It uses the existing output directory (unless overriden in the config), but exports currencies.rdsand the index_regions.rds to enhance reproducibility.

If you have a different default path you would like to use, I'm open to suggestions, but this one minimizes the changes we need to make for other parts of the system.

cjyetman · 2024-02-23T10:40:18Z

I happy with the default path, I'm not in favor of adding a new optional path which then requires continued support and consideration, especially without any reasonable justification for it.

AlexAxthelm · 2024-02-23T10:54:49Z

The idea for that is that when running locally, or during development, or during a reconstruction of a dataset, you can specify that path to use the downloaded files (wherever they reside on the host machine). Without that, there's no good way to specify what path to use for those files.

cjyetman · 2024-02-23T11:02:45Z

Yeah, that's why I'm not thrilled with the idea of having timestamped output directories. My gut feeling would be to go back to saving it to the inputs directory and then also copy it over to the outputs directory, which was just changed in #127. I was preemptively avoiding a potential problem with the inputs directory being read-only with that, but I don't know if that's really a concern. If not, probably best to just revert #127.

AlexAxthelm · 2024-02-23T11:39:50Z

I think the big difference between this and reverting #127 is that here we're only assuming that the outputs path is writable, and we can preserve that behavior, while still allowing for the split between the FactSet and AI data.

cjyetman · 2024-02-23T12:05:54Z

I think the big difference between this and reverting #127 is that here we're only assuming that the outputs path is writable, and we can preserve that behavior, while still allowing for the split between the FactSet and AI data.

Is this a plausible concern, that the inputs directory might be read-only on Azure? I only imagined that possibility, but not sure if it's real. If so, then yeah, we're going to have to add a new directory 🤔

AlexAxthelm · 2024-02-23T12:08:17Z

Is this a plausible concern, that the inputs directory might be read-only on Azure?

Not just a plausible concern, but a definite fact.

I'd rather avoid having a directory there with just the scraped inputs though. My main goal with this PR is to have a way to say "from this set of inputs, do we get the same set of outputs." Not "If we lose all the outputs, can we recreate them from scratch.

cjyetman · 2024-02-23T12:30:21Z

ok, so I guess we have to do this then... can we make the setting of the preflight_path conditional on update_currencies since they're interdependent? I'm concerned about having a config with update_currencies = TRUE and preflight_path = "/some/path" and then there being some unexpected behavior

to understand the logic...

if update_currencies = TRUE and preflight_path = "/some/path" - currencies are scraped and saved to specified preflight_path and then copied to specified data_prep_outputs_path
if update_currencies = TRUE and preflight_path = "" - currencies are scraped and saved to specified data_prep_outputs_path only
if update_currencies = FALSE and preflight_path = "/some/path" - currencies are not scraped but read in from "preflight_path/currencies.rds" and copied to specified data_prep_outputs_path
if update_currencies = FALSE and preflight_path = "" - currencies are not scraped but read in from "preflight_path/currencies.rds" and saved to specified data_prep_outputs_path only

Does that sound right?

AlexAxthelm · 2024-02-23T12:44:54Z

ok, so I guess we have to do this then... can we make the setting of the preflight_path conditional on update_currencies since they're interdependent? I'm concerned about having a config with update_currencies = TRUE and preflight_path = "/some/path" and then there being some unexpected behavior

to understand the logic...

if update_currencies = TRUE and preflight_path = "/some/path" - currencies are scraped and saved to specified preflight_path and then copied to specified data_prep_outputs_path

if update_currencies = TRUE and preflight_path = "" - currencies are scraped and saved to specified data_prep_outputs_path only

if update_currencies = FALSE and preflight_path = "/some/path" - currencies are not scraped but read in from "preflight_path/currencies.rds" and copied to specified data_prep_outputs_path

if update_currencies = FALSE and preflight_path = "" - currencies are not scraped but read in from "preflight_path/currencies.rds" and saved to specified data_prep_outputs_path only

Does that sound right?

Mostly.

if update_currencies = FALSE and preflight_path = "" - currencies are not scraped but read in from "data_prep_outputs_path/currencies.rds" **which will (probably) cause an error on read, since the file (probably) doesn't exist.

I think a lot of your concerns about interdependency are addressed in #149, where all the input and preflight filepaths are checked for existence (if needed) prior to moving forwards.

cjyetman · 2024-02-23T12:56:33Z

fine, I prefer to have relevant checks included where they became necessary, but these are edge cases and we'll get to it soon

Update run_data_preparation to use changed function in https://github.com/RMI-PACTA/pacta.data.preparation/pull/341. Also simplifies somewhat the logic around the list of input filepaths. Should be merged after that, but coordinated closely. **Note:** as-written, this includes #150, but reverting back to [35fd720](35fd720) removes that dependency. Also as written, depends on #153, as this does a similar simplification from `list.files()` to explicit filepath lists for the archive export feature defined in that PR. Reverting [1b8cd49](1b8cd49) removes that connection. - [x] Depends on #150 - [x] Depends on #153 Closes #147 --------- Co-authored-by: CJ Yetman <[email protected]>

feat(app): #148 Export data from scraped files

2d2c72f

Closes: #148

AlexAxthelm requested a review from jdhoffa February 18, 2024 16:18

AlexAxthelm requested a review from cjyetman as a code owner February 18, 2024 16:18

AlexAxthelm added a commit that referenced this pull request Feb 18, 2024

feat(app): #147 INclude preflight paths in input_filepaths

15058c5

adjust code to account for changes made in #150

This was referenced Feb 18, 2024

Feat/147 update calls to write manifest #149

Merged

Develop vm #146

Closed

jdhoffa previously approved these changes Feb 19, 2024

View reviewed changes

cjyetman mentioned this pull request Feb 22, 2024

save index_regions to outputs #158

Merged

cjyetman closed this Feb 22, 2024

cjyetman added a commit that referenced this pull request Feb 22, 2024

save index_regions to outputs (#158)

e27b2d4

supersedes #150 closes #148

AlexAxthelm reopened this Feb 22, 2024

Merge branch 'main' into feat/148-export-scraped-inputs

f74c51f

cjyetman added the config label Feb 23, 2024

Merge branch 'main' into feat/148-export-scraped-inputs

94fe89f

cjyetman dismissed jdhoffa’s stale review via 94fe89f February 23, 2024 08:16

Merge branch 'main' into feat/148-export-scraped-inputs

60dd009

cjyetman approved these changes Feb 23, 2024

View reviewed changes

AlexAxthelm merged commit 5b2617e into main Feb 23, 2024

AlexAxthelm deleted the feat/148-export-scraped-inputs branch February 23, 2024 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(app): #148 Export data from scraped files #150

feat(app): #148 Export data from scraped files #150

AlexAxthelm commented Feb 18, 2024 •

edited

Loading

jdhoffa left a comment

cjyetman commented Feb 22, 2024

AlexAxthelm commented Feb 22, 2024

cjyetman commented Feb 22, 2024 •

edited

Loading

AlexAxthelm commented Feb 23, 2024

cjyetman commented Feb 23, 2024

AlexAxthelm commented Feb 23, 2024

cjyetman commented Feb 23, 2024

AlexAxthelm commented Feb 23, 2024

cjyetman commented Feb 23, 2024

AlexAxthelm commented Feb 23, 2024

cjyetman commented Feb 23, 2024

AlexAxthelm commented Feb 23, 2024

cjyetman commented Feb 23, 2024

AlexAxthelm commented Feb 23, 2024

cjyetman commented Feb 23, 2024

feat(app): #148 Export data from scraped files #150

feat(app): #148 Export data from scraped files #150

Conversation

AlexAxthelm commented Feb 18, 2024 • edited Loading

jdhoffa left a comment

Choose a reason for hiding this comment

cjyetman commented Feb 22, 2024

AlexAxthelm commented Feb 22, 2024

cjyetman commented Feb 22, 2024 • edited Loading

AlexAxthelm commented Feb 23, 2024

cjyetman commented Feb 23, 2024

AlexAxthelm commented Feb 23, 2024

cjyetman commented Feb 23, 2024

AlexAxthelm commented Feb 23, 2024

cjyetman commented Feb 23, 2024

AlexAxthelm commented Feb 23, 2024

cjyetman commented Feb 23, 2024

AlexAxthelm commented Feb 23, 2024

cjyetman commented Feb 23, 2024

AlexAxthelm commented Feb 23, 2024

cjyetman commented Feb 23, 2024

AlexAxthelm commented Feb 18, 2024 •

edited

Loading

cjyetman commented Feb 22, 2024 •

edited

Loading