SUSTAINABILITY: Maintaining the list of datasets for the Canadian COVID-19 Data Archive #4
Labels
Canadian COVID-19 Data Archive
Issues directly related to the Canadian COVID-19 Data Archive
list of datasets
Maintenance or additions to the list of datasets, including metadata
metadata
Metadata for archived or derived datasets
sustainability
Long-term sustainability of the project
The list of active and inactive datasets in the Canadian COVID-19 Canada Data Archive, along with all associated metadata, is given in
datasets.json
. It has hundreds of entries.This is also the data format used by
archivist
andCovid19CanadaArchive
used to produce the nightly automated data updates (#2). It is also used to keep the COVID-19 Canada Open Data Working Group datasets updated (seeCovid19CanadaETL
,Covid19Canada
andCovidTimelineCanada
. All datasets are identified with a unique UUID generated by UUID version 4.This list is maintained manually by the maintainer (me) based on personal knowledge of Canadian COVID-19 datasets as well as tips from data users in the form of personal communications or GitHub issues. Naturally, this is work-intensive and it is not always obvious when a new dataset is available or an old dataset has been retired, leading to (potential) loss of the historical record.
Main areas of improvement
These are the main areas of improvement I see for improving sustainability of the dataset list maintenance:
utils.py
currently contains two commonly used functions: (retire_dataset
), which moves a dataset from "active" to "inactive" in the list of datasets (datasets.json
) andlist_inactive_datasets
, which creates a list of datasets that have produces identical files for a certain number of days, suggesting the dataset may no longer be updated and can be safely moved to "inactive" statusdatasets.json
) must be validated before they are accepted in order to not disrupt tools that rely on the list of datasetsdatasets.json
) would have to be made compatible with the existing tools that use this file, such as the nightly archive update process andCovid19CanadaETL
datasets.json
could be converted to some existing format/standard for this sort of data?It would be helpful to find precedents for a community-maintained dataset archive/scraping list.
The text was updated successfully, but these errors were encountered: