Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validation of downloaded data #126

Open
e-kotov opened this issue Jan 8, 2025 · 2 comments
Open

validation of downloaded data #126

e-kotov opened this issue Jan 8, 2025 · 2 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@e-kotov
Copy link
Member

e-kotov commented Jan 8, 2025

The problem

curl::multi_download() works really well, but I had a few quirks where downloaded data was corrupt. Basically, I had some incomplete downloads that caused errors downstream, even with resume=TRUE in curl::multi_download(), because we only pass the files to download that are not found on disk, without checking how complete or incomplete they are. Would be nice to prevent running the analysis and/or conversion to duckdb/parquet on incomplete files.

Therefore, I think it would be nice to have some simple mechanism independent of {curl} that checks the files against some known characteristics.

Checking MD5/SHA256 or other checksums would take too much time. However, checking the file sizes may be feasible.

The challenge is to keep an up to date metadata file with those file sizes. We currently have a snapshots of metadata for v1 and v2 as url_file_size_v1.txt.gz and url_file_size_v2.txt.gz in https://github.com/rOpenSpain/spanishoddata/tree/main/inst/extdata . But that is not super sustainable:

  1. Currently it requires manual update (but at least there is a function to do that here https://github.com/rOpenSpain/spanishoddata/blob/main/R/dev-tools.R ).
  2. To push this update to the CRAN version, we have to actually submit a new package version to CRAN.
  3. Data updates are now not once every 3 months, but new data is published with a delay of just 3-4 days. So updates are very frequent.

Possible solutions to keeping file sizes up to date

  1. Setup (in a new separate repo?) a GitHub action that runs a script that checks for the xml update upstream every day or every hour, fetches the file size for new files and stores the resulting file size database as a release on GitHub (so better in a separate repo). Then {spanishoddata} fetches this once per session.
  2. Try asking upstream if the data provider can embed file sizes (and even better maybe also checksums) into the xml with file links.

Thoughts on implementation

Ok, say we have a reliable way to get the up to date file sizes. How to use them to improve robustness?

  1. Check on disk file sizes in spod_get() before returning the tbl_connection to the user.
  2. Check on disk file sizes in spod_convert() before starting the conversion.
  3. Check on disk file sizes in spod_download() and keep the incomplete files in the list of files passed over to curl::multi_download().
@e-kotov
Copy link
Member Author

e-kotov commented Jan 8, 2025

Asked upstream if it is possible to provide file sizes, md5 and or sha256 checksums in the XML metadata file with the download links.

@e-kotov e-kotov added bug Something isn't working enhancement New feature or request labels Jan 8, 2025
@e-kotov
Copy link
Member Author

e-kotov commented Jan 10, 2025

Quick update:

  • passing one link to the file that we know is incomplete locally works well with curl::multi_download() and the Ministry data server. The download resumes and gets the rest of the file, and the file in the end is complete and valid.

  • passing all links (e.g. 1000+ links to district OD data) without filtering for existing local files also seems to work well with curl::multi_download() as it very quickly checks the files and only resumes the ones that need to be resumed. Does not seem like local file size checks are necessary.

  • yet, when we know that we have all the data locally and complete, checking these links with curl::multi_download() takes about 30 seconds (but may take more on a slow connection), which is not acceptable, as spod_download is used internally in spod_get, and we should not make the user wait for each spod_get call to check all (or at least all of the requested) links every time.

  • so we still do need a built-in mechanism for checking the local file size consistency

  • optionally, we may have another function (or an argument for existing functions) for those who want to be 100% sure about the local data completeness (though ideally it's everyone, of course) that checks MD5 and/or SHA256 (probably too slow, perhaps there are better options) checksums of local files against known values that we can precompute as described above:

  1. Setup (in a new separate repo?) a GitHub action that runs a script that checks for the xml update upstream every day or every hour, fetches the file size for new files and stores the resulting file size database as a release on GitHub (so better in a separate repo). Then {spanishoddata} fetches this once per session.

For reference, the draft code for these tests is a work in progress in https://github.com/rOpenSpain/spanishoddata/tree/126-validation-of-downloaded-data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant