You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
curl::multi_download() works really well, but I had a few quirks where downloaded data was corrupt. Basically, I had some incomplete downloads that caused errors downstream, even with resume=TRUE in curl::multi_download(), because we only pass the files to download that are not found on disk, without checking how complete or incomplete they are. Would be nice to prevent running the analysis and/or conversion to duckdb/parquet on incomplete files.
Therefore, I think it would be nice to have some simple mechanism independent of {curl} that checks the files against some known characteristics.
Checking MD5/SHA256 or other checksums would take too much time. However, checking the file sizes may be feasible.
The challenge is to keep an up to date metadata file with those file sizes. We currently have a snapshots of metadata for v1 and v2 as url_file_size_v1.txt.gz and url_file_size_v2.txt.gz in https://github.com/rOpenSpain/spanishoddata/tree/main/inst/extdata . But that is not super sustainable:
To push this update to the CRAN version, we have to actually submit a new package version to CRAN.
Data updates are now not once every 3 months, but new data is published with a delay of just 3-4 days. So updates are very frequent.
Possible solutions to keeping file sizes up to date
Setup (in a new separate repo?) a GitHub action that runs a script that checks for the xml update upstream every day or every hour, fetches the file size for new files and stores the resulting file size database as a release on GitHub (so better in a separate repo). Then {spanishoddata} fetches this once per session.
Try asking upstream if the data provider can embed file sizes (and even better maybe also checksums) into the xml with file links.
Thoughts on implementation
Ok, say we have a reliable way to get the up to date file sizes. How to use them to improve robustness?
Check on disk file sizes in spod_get() before returning the tbl_connection to the user.
Check on disk file sizes in spod_convert() before starting the conversion.
Check on disk file sizes in spod_download() and keep the incomplete files in the list of files passed over to curl::multi_download().
The text was updated successfully, but these errors were encountered:
passing one link to the file that we know is incomplete locally works well with curl::multi_download() and the Ministry data server. The download resumes and gets the rest of the file, and the file in the end is complete and valid.
passing all links (e.g. 1000+ links to district OD data) without filtering for existing local files also seems to work well with curl::multi_download() as it very quickly checks the files and only resumes the ones that need to be resumed. Does not seem like local file size checks are necessary.
yet, when we know that we have all the data locally and complete, checking these links with curl::multi_download() takes about 30 seconds (but may take more on a slow connection), which is not acceptable, as spod_download is used internally in spod_get, and we should not make the user wait for each spod_get call to check all (or at least all of the requested) links every time.
so we still do need a built-in mechanism for checking the local file size consistency
optionally, we may have another function (or an argument for existing functions) for those who want to be 100% sure about the local data completeness (though ideally it's everyone, of course) that checks MD5 and/or SHA256 (probably too slow, perhaps there are better options) checksums of local files against known values that we can precompute as described above:
Setup (in a new separate repo?) a GitHub action that runs a script that checks for the xml update upstream every day or every hour, fetches the file size for new files and stores the resulting file size database as a release on GitHub (so better in a separate repo). Then {spanishoddata} fetches this once per session.
The problem
curl::multi_download()
works really well, but I had a few quirks where downloaded data was corrupt. Basically, I had some incomplete downloads that caused errors downstream, even withresume=TRUE
incurl::multi_download()
, because we only pass the files to download that are not found on disk, without checking how complete or incomplete they are. Would be nice to prevent running the analysis and/or conversion to duckdb/parquet on incomplete files.Therefore, I think it would be nice to have some simple mechanism independent of
{curl}
that checks the files against some known characteristics.Checking MD5/SHA256 or other checksums would take too much time. However, checking the file sizes may be feasible.
The challenge is to keep an up to date metadata file with those file sizes. We currently have a snapshots of metadata for v1 and v2 as
url_file_size_v1.txt.gz
andurl_file_size_v2.txt.gz
in https://github.com/rOpenSpain/spanishoddata/tree/main/inst/extdata . But that is not super sustainable:Possible solutions to keeping file sizes up to date
{spanishoddata}
fetches this once per session.Thoughts on implementation
Ok, say we have a reliable way to get the up to date file sizes. How to use them to improve robustness?
spod_get()
before returning the tbl_connection to the user.spod_convert()
before starting the conversion.spod_download()
and keep the incomplete files in the list of files passed over tocurl::multi_download()
.The text was updated successfully, but these errors were encountered: