validation of downloaded data #126

e-kotov · 2025-01-08T16:42:29Z

The problem

curl::multi_download() works really well, but I had a few quirks where downloaded data was corrupt. Basically, I had some incomplete downloads that caused errors downstream, even with resume=TRUE in curl::multi_download(), because we only pass the files to download that are not found on disk, without checking how complete or incomplete they are. Would be nice to prevent running the analysis and/or conversion to duckdb/parquet on incomplete files.

Therefore, I think it would be nice to have some simple mechanism independent of {curl} that checks the files against some known characteristics.

Checking MD5/SHA256 or other checksums would take too much time. However, checking the file sizes may be feasible.

The challenge is to keep an up to date metadata file with those file sizes. We currently have a snapshots of metadata for v1 and v2 as url_file_size_v1.txt.gz and url_file_size_v2.txt.gz in https://github.com/rOpenSpain/spanishoddata/tree/main/inst/extdata . But that is not super sustainable:

Currently it requires manual update (but at least there is a function to do that here https://github.com/rOpenSpain/spanishoddata/blob/main/R/dev-tools.R ).
To push this update to the CRAN version, we have to actually submit a new package version to CRAN.
Data updates are now not once every 3 months, but new data is published with a delay of just 3-4 days. So updates are very frequent.

Possible solutions to keeping file sizes up to date

Setup (in a new separate repo?) a GitHub action that runs a script that checks for the xml update upstream every day or every hour, fetches the file size for new files and stores the resulting file size database as a release on GitHub (so better in a separate repo). Then {spanishoddata} fetches this once per session.
Try asking upstream if the data provider can embed file sizes (and even better maybe also checksums) into the xml with file links.

Thoughts on implementation

Ok, say we have a reliable way to get the up to date file sizes. How to use them to improve robustness?

Check on disk file sizes in spod_get() before returning the tbl_connection to the user.
Check on disk file sizes in spod_convert() before starting the conversion.
Check on disk file sizes in spod_download() and keep the incomplete files in the list of files passed over to curl::multi_download().

The text was updated successfully, but these errors were encountered:

e-kotov · 2025-01-08T17:31:38Z

Asked upstream if it is possible to provide file sizes, md5 and or sha256 checksums in the XML metadata file with the download links.

e-kotov · 2025-01-10T11:54:02Z

Quick update:

passing one link to the file that we know is incomplete locally works well with curl::multi_download() and the Ministry data server. The download resumes and gets the rest of the file, and the file in the end is complete and valid.
passing all links (e.g. 1000+ links to district OD data) without filtering for existing local files also seems to work well with curl::multi_download() as it very quickly checks the files and only resumes the ones that need to be resumed. Does not seem like local file size checks are necessary.
yet, when we know that we have all the data locally and complete, checking these links with curl::multi_download() takes about 30 seconds (but may take more on a slow connection), which is not acceptable, as spod_download is used internally in spod_get, and we should not make the user wait for each spod_get call to check all (or at least all of the requested) links every time.
so we still do need a built-in mechanism for checking the local file size consistency
optionally, we may have another function (or an argument for existing functions) for those who want to be 100% sure about the local data completeness (though ideally it's everyone, of course) that checks MD5 and/or SHA256 (probably too slow, perhaps there are better options) checksums of local files against known values that we can precompute as described above:

Setup (in a new separate repo?) a GitHub action that runs a script that checks for the xml update upstream every day or every hour, fetches the file size for new files and stores the resulting file size database as a release on GitHub (so better in a separate repo). Then {spanishoddata} fetches this once per session.

For reference, the draft code for these tests is a work in progress in https://github.com/rOpenSpain/spanishoddata/tree/126-validation-of-downloaded-data

e-kotov mentioned this issue Jan 8, 2025

data download fails silently #127

Open

e-kotov added bug Something isn't working enhancement New feature or request labels Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validation of downloaded data #126

validation of downloaded data #126

e-kotov commented Jan 8, 2025

e-kotov commented Jan 8, 2025

e-kotov commented Jan 10, 2025 •

edited

Loading

validation of downloaded data #126

validation of downloaded data #126

Comments

e-kotov commented Jan 8, 2025

The problem

Possible solutions to keeping file sizes up to date

Thoughts on implementation

e-kotov commented Jan 8, 2025

e-kotov commented Jan 10, 2025 • edited Loading

e-kotov commented Jan 10, 2025 •

edited

Loading