Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement pacta.data.validation #185

Merged
merged 18 commits into from
Apr 18, 2024
Merged

implement pacta.data.validation #185

merged 18 commits into from
Apr 18, 2024

Conversation

cjyetman
Copy link
Member

@cjyetman cjyetman commented Mar 7, 2024

closes #18

Most validation errors originally found (below) have been resolved. Validation of financial_data and abcd_flags_equity has been removed for now and will be added in the future.

investigation issues:

relevant fixes in pacta.data.validation:

relevant fix in pacta.data.preparation

validation of financial_data and abcd_flags_equity has been removed from this PR, and future intended implementation is tracked here


currently, a few of the validation checks fail using 2023Q4 config

root_dir <- "~/data/workflow-data-preparation-outputs/2023Q4_20240303T082642Z"

pacta.data.validation::validate_currencies(
  readRDS(file.path(root_dir, "currencies.rds"))
)

pacta.data.validation::validate_financial_data(
  readRDS(file.path(root_dir, "financial_data.rds"))
)
#> Error in base::tryCatch(base::withCallingHandlers({: 1 assertions failed:
#>  * Variable 'data$isin': must contain only valid ISINs, but has
#>  * additional elements "US78467KA#42", "US85255AA*13", "NOR6778@AB29",
#>  * "US048303E#47", "US313855F*48", "US372460A#28", "CA820439A#42",
#>  * "CA008474E@39", "US941848D#79", "US524908@431", "US00253#AB60",
#>  * "US615369A@49", "US893578A@31", "US57555*AD16", "BEB7227*AB60",
#>  * "US031100A@92", "US628464A#60", "US10468*AE44", …, "US758750B*36",
#>  * and "US15135#BQ49".

pacta.data.validation::validate_masterdata_debt_datastore(
  readRDS(file.path(root_dir, "masterdata_debt_datastore.rds"))
)
#> Error in base::tryCatch(base::withCallingHandlers({: 4 assertions failed:
#>  * Variable 'data$technology': must contain only valid technology
#>  * names, but has additional element "ICE Hydrogen_HDV".
#>  * Variable 'data$ald_production_unit': must contain only valid
#>  * production units, but has additional elements "# vehicles", "dwt
#>  * km", "t cement", "t coal", and "t steel".
#>  * Variable 'data$ald_emissions_factor_unit': must contain only valid
#>  * emissions factor units, but has additional elements "tCO2/dwt km",
#>  * "tCO2/km", "tCO2/pkm", "tCO2/tkm", "tCO2e/GJ", "tCO2e/MWh", "tCO2e/t
#>  * cement", "tCO2e/t coal", and "tCO2e/t steel".
#>  * Variable 'data$technology': must contain only valid technology names
#>  * for HDV, but has additional elements %s.

pacta.data.validation::validate_masterdata_ownership_datastore(
  readRDS(file.path(root_dir, "masterdata_ownership_datastore.rds"))
)
#> Error in base::tryCatch(base::withCallingHandlers({: 4 assertions failed:
#>  * Variable 'data$technology': must contain only valid technology
#>  * names, but has additional element "ICE Hydrogen_HDV".
#>  * Variable 'data$ald_production_unit': must contain only valid
#>  * production units, but has additional elements "# vehicles", "dwt
#>  * km", "t cement", "t coal", and "t steel".
#>  * Variable 'data$ald_emissions_factor_unit': must contain only valid
#>  * emissions factor units, but has additional elements "tCO2/dwt km",
#>  * "tCO2/km", "tCO2/pkm", "tCO2/tkm", "tCO2e/GJ", "tCO2e/MWh", "tCO2e/t
#>  * cement", "tCO2e/t coal", and "tCO2e/t steel".
#>  * Variable 'data$technology': must contain only valid technology names
#>  * for HDV, but has additional elements %s.

pacta.data.validation::validate_abcd_flags_bonds(
  readRDS(file.path(root_dir, "abcd_flags_bonds.rds"))
)
#> Error in assert_valid_factset_entity_id(data[[col_name]], add = coll, : Assertion on 'data$credit_parent_id' failed: must contain only valid FactSet entity IDs, but has additional element NA.

pacta.data.validation::validate_abcd_flags_equity(
  readRDS(file.path(root_dir, "abcd_flags_equity.rds"))
)
#> Error in base::tryCatch(base::withCallingHandlers({: 1 assertions failed:
#>  * Variable 'data$isin': must contain only valid ISINs, but has
#>  * additional elements "US78467KA#42", "US85255AA*13", "NOR6778@AB29",
#>  * "US048303E#47", "US313855F*48", "US372460A#28", "CA820439A#42",
#>  * "CA008474E@39", "US941848D#79", "US524908@431", "US00253#AB60",
#>  * "US615369A@49", "US893578A@31", "US57555*AD16", "BEB7227*AB60",
#>  * "US031100A@92", "US628464A#60", "US10468*AE44", …, "US758750B*36",
#>  * and "US15135#BQ49".

@jdhoffa
Copy link
Member

jdhoffa commented Mar 7, 2024

💯

@jdhoffa
Copy link
Member

jdhoffa commented Mar 7, 2024

#> Error in base::tryCatch(base::withCallingHandlers({: 1 assertions failed:
#>  * Variable 'data$isin': must contain only valid ISINs, but has
#>  * additional elements "US78467KA#42", "US85255AA*13", "NOR6778@AB29",
#>  * "US048303E#47", "US313855F*48", "US372460A#28", "CA820439A#42",
#>  * "CA008474E@39", "US941848D#79", "US524908@431", "US00253#AB60",
#>  * "US615369A@49", "US893578A@31", "US57555*AD16", "BEB7227*AB60",
#>  * "US031100A@92", "US628464A#60", "US10468*AE44", …, "US758750B*36",
#>  * and "US15135#BQ49".

AFAIK, those are valid ISINs, they are just ISINs of private securities (venture capital, and things like that). Had a look at this when I was looking into Private equity, it's sorta cool that we have them actually! I would loosen the restrictions on the expectation there personally (but maybe ask Nick for a second opinion?)

But in any case, TM doesn't publicize that it can cover private equity, so I think it's totally fine to just filter them out.

@jdhoffa
Copy link
Member

jdhoffa commented Mar 7, 2024

#> Error in base::tryCatch(base::withCallingHandlers({: 4 assertions failed:
#>  * Variable 'data$technology': must contain only valid technology
#>  * names, but has additional element "ICE Hydrogen_HDV".
#>  * Variable 'data$ald_production_unit': must contain only valid
#>  * production units, but has additional elements "# vehicles", "dwt
#>  * km", "t cement", "t coal", and "t steel".
#>  * Variable 'data$ald_emissions_factor_unit': must contain only valid
#>  * emissions factor units, but has additional elements "tCO2/dwt km",
#>  * "tCO2/km", "tCO2/pkm", "tCO2/tkm", "tCO2e/GJ", "tCO2e/MWh", "tCO2e/t
#>  * cement", "tCO2e/t coal", and "tCO2e/t steel".
#>  * Variable 'data$technology': must contain only valid technology names
#>  * for HDV, but has additional elements %s.

This part is pretty important. Could be coming from two places, either:

  • AI decided to update the values it uses for emission factors and what not (e.g. tCO2/ km -> tCO2/km)
  • AI is providing certain new data points in the output that are not even expected (e.g. tCO2/dwt km)
  • (or both?)

I guess we have a decision to make regarding:

  • Updating pacta.scenario.preparation and pacta.data.validation to expect what is here
  • Updating masterdata_debt to match units expected by pacta.scenario.preparation and pacta.data.validation

Of the two, I guess I would probably prefer the former, which will require a PR to:
https://github.com/RMI-PACTA/pacta.scenario.preparation
But happy to do either, thoughts @cjyetman @AlexAxthelm ?

Hopefully, we just need to adjust the strings themselves, and don't need to change the actual data/ assumptions at all.

@jdhoffa
Copy link
Member

jdhoffa commented Mar 7, 2024

This seems to just be an issue with HDV, which I don't really use at all (it's still not supported in r2dii.*) so I don't have any useful feedback there.

What technology name were you using for HDV here?

#>  * Variable 'data$technology': must contain only valid technology names
#>  * for HDV, but has additional elements %s.

@cjyetman cjyetman changed the title implement pacta.data.validation implement pacta.data.validation Mar 14, 2024
@cjyetman cjyetman marked this pull request as ready for review April 5, 2024 09:25
@cjyetman cjyetman marked this pull request as draft April 5, 2024 09:26
@cjyetman cjyetman merged commit a837c76 into main Apr 18, 2024
1 check passed
@cjyetman cjyetman deleted the use-pacta.data.validation branch April 18, 2024 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use pacta.data.validation functions to validate pacta data
2 participants