Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use explicit filenames for FactSet files #99

Merged
merged 14 commits into from
Feb 12, 2024

Conversation

cjyetman
Copy link
Member

@cjyetman cjyetman commented Feb 9, 2024

closes #82
based on a seemingly rough consensus here: #75 (comment)

@cjyetman
Copy link
Member Author

cjyetman commented Feb 9, 2024

This is a bit preemptive because we don't yet know what the timestamped FactSet files should really be for each quarter, but the goal of this PR is to structurally change the config file so that the expectation is that FactSet filenames are explicitly named in each config set. Doing so will facilitate testing and continued development while we wait for the the approved FactSet files to be selected, so given that, I think it's alright to merge this now and I'll make an issue to update the explicit FactSet filenames once they are known (#108).

@cjyetman cjyetman marked this pull request as ready for review February 9, 2024 11:35
@jdhoffa
Copy link
Member

jdhoffa commented Feb 9, 2024

I don't really have anything relevant to review here.
Seems reasonable but depends primarily on @AlexAxthelm work in workflow.factset. Will leave the review to him.

Copy link
Collaborator

@AlexAxthelm AlexAxthelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that the checks for existence will fail if the factset_data_path exists, but the files that are supposed to be there don't.

stopifnot(file.exists(factset_financial_data_path))
stopifnot(file.exists(factset_entity_info_path))
stopifnot(file.exists(factset_entity_financing_data_path))
stopifnot(file.exists(factset_fund_data_path))
stopifnot(file.exists(factset_isin_to_fund_table_path))
stopifnot(file.exists(factset_iss_emissions_data_path))

file.exists() returns TRUE for directories that exist, so if it's checking on file.exists(file.path(some_dir, "")), it's only checking if some_dir exists. So if the wrong config is passed in, we'll eventually get errors about those files not existing, but it could bypass these checks (which are meant to cause a fast fail).

Suggest either stengthening the checks, or setting the defaults to NULL rather than "", which will cause file.path() to return an empty string.

@cjyetman
Copy link
Member Author

cjyetman commented Feb 9, 2024

My concern is that the checks for existence will fail if the factset_data_path exists, but the files that are supposed to be there don't.

stopifnot(file.exists(factset_financial_data_path))
stopifnot(file.exists(factset_entity_info_path))
stopifnot(file.exists(factset_entity_financing_data_path))
stopifnot(file.exists(factset_fund_data_path))
stopifnot(file.exists(factset_isin_to_fund_table_path))
stopifnot(file.exists(factset_iss_emissions_data_path))

file.exists() returns TRUE for directories that exist, so if it's checking on file.exists(file.path(some_dir, "")), it's only checking if some_dir exists. So if the wrong config is passed in, we'll eventually get errors about those files not existing, but it could bypass these checks (which are meant to cause a fast fail).

Suggest either stengthening the checks, or setting the defaults to NULL rather than "", which will cause file.path() to return an empty string.

Good point! Even NULL isn't safe...

file.exists(file.path("~/Desktop/"))
#> [1] TRUE

file.exists(file.path("~/Desktop/", ""))
#> [1] TRUE
stopifnot(file.exists(file.path("~/Desktop/", "")))

file.exists(file.path("~/Desktop/", NULL))
#> logical(0)
stopifnot(file.exists(file.path("~/Desktop/", NULL)))

test <- NULL
file.exists(file.path("~/Desktop/", test))
#> logical(0)
stopifnot(file.exists(file.path("~/Desktop/", test)))

@cjyetman
Copy link
Member Author

cjyetman commented Feb 9, 2024

I guess we'll need a new section that tests if certain parameters are not NULL or ""?

factset_financial_data_filename <- "test"
stopifnot(isFALSE(is.null(factset_financial_data_filename) || factset_financial_data_filename == ""))

factset_financial_data_filename <- ""
stopifnot(isFALSE(is.null(factset_financial_data_filename) || factset_financial_data_filename == ""))
#> Error: isFALSE(is.null(factset_financial_data_filename) || factset_financial_data_filename ==  .... is not TRUE

factset_financial_data_filename <- NULL
stopifnot(isFALSE(is.null(factset_financial_data_filename) || factset_financial_data_filename == ""))
#> Error: isFALSE(is.null(factset_financial_data_filename) || factset_financial_data_filename ==  .... is not TRUE

like

# check that parameters are not blank ------------------------------------------

stopifnot(isFALSE(is.null(factset_financial_data_filename) || factset_financial_data_filename == ""))
stopifnot(isFALSE(is.null(factset_entity_info_filename) || factset_entity_info_filename == ""))
stopifnot(isFALSE(is.null(factset_entity_financing_data_filename) || factset_entity_financing_data_filename == ""))
stopifnot(isFALSE(is.null(factset_fund_data_filename) || factset_fund_data_filename == ""))
stopifnot(isFALSE(is.null(factset_isin_to_fund_table_filename) || factset_isin_to_fund_table_filename == ""))
stopifnot(isFALSE(is.null(factset_iss_emissions_data_filename) || factset_iss_emissions_data_filename == ""))

@AlexAxthelm
Copy link
Collaborator

Good point! Even NULL isn't safe...

I think you can simplify the logic a fair bit by just putting the file.exists() in isTRUE(), which can handle the length 0 string

setwd(tempdir())

dir_name <- "foo"
dir.create(dir_name)
dir.exists(dir_name)
#> [1] TRUE
file.exists(dir_name)
#> [1] TRUE

file_one <- "file1.txt"
file.exists(file.path(dir_name, file_one))
#> [1] FALSE

file_two <- ""
file.exists(file.path(dir_name, file_two))
#> [1] TRUE
isFALSE(file.info(file.path(dir_name, file_one))$isdir)
#> [1] FALSE

file_three <- NULL
file.exists(file.path(dir_name, file_three))
#> logical(0)

## Suggested Implementation
isTRUE(file.exists(file.path(dir_name, file_three)))
#> [1] FALSE

Created on 2024-02-09 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31)
#>  os       macOS Sonoma 14.2
#>  system   aarch64, darwin23.0.0
#>  ui       unknown
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Belgrade
#>  date     2024-02-09
#>  pandoc   3.1.7 @ /opt/homebrew/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.2)
#>  digest        0.6.34  2024-01-11 [1] CRAN (R 4.3.2)
#>  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.2)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.2)
#>  fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.2)
#>  glue          1.7.0   2024-01-09 [1] CRAN (R 4.3.2)
#>  htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.2)
#>  knitr         1.45    2023-10-30 [1] CRAN (R 4.3.2)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.1)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.2)
#>  purrr         1.0.2   2023-08-10 [1] CRAN (R 4.3.2)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.3.1)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.3.1)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.3.1)
#>  R.utils       2.12.2  2022-11-11 [1] CRAN (R 4.3.1)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.3.1)
#>  rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.2)
#>  rmarkdown     2.25    2023-09-18 [1] CRAN (R 4.3.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
#>  styler        1.10.2  2023-08-29 [1] CRAN (R 4.3.1)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.2)
#>  withr         3.0.0   2024-01-16 [1] CRAN (R 4.3.2)
#>  xfun          0.41    2023-11-01 [1] CRAN (R 4.3.2)
#>  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.2)
#> 
#>  [1] /opt/homebrew/lib/R/4.3/site-library
#>  [2] /opt/homebrew/Cellar/r/4.3.2/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

@cjyetman
Copy link
Member Author

cjyetman commented Feb 9, 2024

but then it doesn't capture ""

dir_name <- "~/Desktop/"
isTRUE(file.exists(file.path(dir_name, "dataprep")))
#> [1] TRUE
isTRUE(file.exists(file.path(dir_name, "FOO")))
#> [1] FALSE
isTRUE(file.exists(file.path(dir_name, "")))
#> [1] TRUE
isTRUE(file.exists(file.path(dir_name, NULL)))
#> [1] FALSE

@cjyetman
Copy link
Member Author

cjyetman commented Feb 10, 2024

I'm going to remove the parameters check from this PR and track that issue in #118.

While they're reasonable, the need has existed for a long time and is not specific to nor created by the changes in this PR. The places where I added/edited parameters to be "", e.g factset_financial_data_filename: "", are not intended to stay that way long term, we simply do not know what those values should be at the moment. Updating those values once they have been determined is tracked here #108. This PR is to set the structure (not content) so that we can continue to test and develop around the new FactSet file situation.

@cjyetman cjyetman requested a review from AlexAxthelm February 10, 2024 05:47
@jdhoffa
Copy link
Member

jdhoffa commented Feb 10, 2024

Have you checked if this problem persists using functions from https://fs.r-lib.org/ ?

I know there's a hesitance to add dependencies which I totally understand but given that this repository is concerned primarily with file I/O, I don't think depending on a package that facilitates better file I/O in R is a terrible idea

@cjyetman
Copy link
Member Author

Have you checked if this problem persists using functions from https://fs.r-lib.org/ ?

What problem are you referring to? There's nothing directly I/O related in this PR.

@jdhoffa
Copy link
Member

jdhoffa commented Feb 10, 2024

Apologies I meant "file system"**, not file I/O

And maybe I misunderstood. I was just curious, since the base R dir.exists doesn't seem to produce the desired output, if

fs::dir_exists()

does?

It may not!

@cjyetman
Copy link
Member Author

cjyetman commented Feb 10, 2024

Apologies I meant "file system"**, not file I/O

And maybe I misunderstood. I was just curious, since the base R dir.exists doesn't seem to produce the desired output, if

fs::dir_exists()

does?

It may not!

Maybe, but I'm avoiding any checking of directories (or anything for that matter) in this PR. This PR is only to change the structure of the config.yml file to align with changes to how the FactSet input files are created and named (timestamps are encoded in the filename and therefore explicit filenames are required for each timestamps's config set).

There's a linked issue #118 to add checking for valid input, which has never been done before.

@jdhoffa
Copy link
Member

jdhoffa commented Feb 10, 2024

I was just responding to the above conversation!

Which it sounds like you would prefer happens in a new issue :-)

@jdhoffa
Copy link
Member

jdhoffa commented Feb 10, 2024

I will see myself out now haha

:-)

@cjyetman
Copy link
Member Author

I was just responding to the above conversation!

Which it sounds like you would prefer happens in a new issue :-)

yes, I prefer PRs to be precise and specific

@jdhoffa
Copy link
Member

jdhoffa commented Feb 12, 2024

I don't have the context or knowledge of the new "Factset" datasets to review this safely, so will leave to @AlexAxthelm to review

@jdhoffa jdhoffa removed their request for review February 12, 2024 09:35
@cjyetman
Copy link
Member Author

@AlexAxthelm?

Copy link
Collaborator

@AlexAxthelm AlexAxthelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with note that the files referenced here are not the final files we want to use, but this puts the infrastructure in place for when we do have a set of files we like.

@cjyetman
Copy link
Member Author

cjyetman commented Feb 12, 2024

Noted. #108

@cjyetman cjyetman merged commit d4310a1 into main Feb 12, 2024
@cjyetman cjyetman deleted the use-explicit-FactSet-filenames branch February 12, 2024 21:23
cjyetman added a commit that referenced this pull request Feb 13, 2024
depends on #99 

This PR computes the FactSet timestamp/s from the filenames of all input
FactSet data files, and adds that info to the manifest. Note that it
would be possible to use FactSet data files from different timestamps...
in which case all unique timestamps would be added to the manifest as an
array. In the future, e.g. in #118, we could consider giving an error or
warning if all of these timestamps do not match.
cjyetman added a commit that referenced this pull request Feb 13, 2024
depends on #99 

not sure why this was being done since the repo gets permanently copied
into the Docker image. @jdhoffa?

Co-authored-by: Jackson Hoffart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use FactSet filenames as the source of truth for the FactSet timestamp
3 participants