Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SUSTAINABILITY: Maintaining the list of datasets for the Canadian COVID-19 Data Archive #4

Open
jeanpaulrsoucy opened this issue Feb 22, 2022 · 1 comment
Labels
Canadian COVID-19 Data Archive Issues directly related to the Canadian COVID-19 Data Archive list of datasets Maintenance or additions to the list of datasets, including metadata metadata Metadata for archived or derived datasets sustainability Long-term sustainability of the project

Comments

@jeanpaulrsoucy
Copy link
Member

jeanpaulrsoucy commented Feb 22, 2022

The list of active and inactive datasets in the Canadian COVID-19 Canada Data Archive, along with all associated metadata, is given in datasets.json. It has hundreds of entries.

This is also the data format used by archivist and Covid19CanadaArchive used to produce the nightly automated data updates (#2). It is also used to keep the COVID-19 Canada Open Data Working Group datasets updated (see Covid19CanadaETL, Covid19Canada and CovidTimelineCanada. All datasets are identified with a unique UUID generated by UUID version 4.

This list is maintained manually by the maintainer (me) based on personal knowledge of Canadian COVID-19 datasets as well as tips from data users in the form of personal communications or GitHub issues. Naturally, this is work-intensive and it is not always obvious when a new dataset is available or an old dataset has been retired, leading to (potential) loss of the historical record.

Main areas of improvement

These are the main areas of improvement I see for improving sustainability of the dataset list maintenance:

  • Involve multiple users
    • Perhaps each region could have an assigned "steward" responsible for keeping relevant datasets up-to-date (e.g., one person for Ontario, one person for Quebec, one person for PHAC, one person for Atlantic Canada, etc.)
  • Expand on automation tools to assist with maintenance
    • utils.py currently contains two commonly used functions: (retire_dataset), which moves a dataset from "active" to "inactive" in the list of datasets (datasets.json) and list_inactive_datasets, which creates a list of datasets that have produces identical files for a certain number of days, suggesting the dataset may no longer be updated and can be safely moved to "inactive" status
  • Create web-based interface for editing
    • It may be easier to collaborate if a fool-proof web-based interface for collaboration is created
    • Changes to the underlying file (e.g., datasets.json) must be validated before they are accepted in order to not disrupt tools that rely on the list of datasets
  • Any changes to the underlying file format (e.g., datasets.json) would have to be made compatible with the existing tools that use this file, such as the nightly archive update process and Covid19CanadaETL
    • Is it possible that datasets.json could be converted to some existing format/standard for this sort of data?

It would be helpful to find precedents for a community-maintained dataset archive/scraping list.

@jeanpaulrsoucy jeanpaulrsoucy added sustainability Long-term sustainability of the project Canadian COVID-19 Data Archive Issues directly related to the Canadian COVID-19 Data Archive list of datasets Maintenance or additions to the list of datasets, including metadata metadata Metadata for archived or derived datasets labels Feb 22, 2022
@jeanpaulrsoucy
Copy link
Member Author

jeanpaulrsoucy commented Feb 26, 2022

A precedent for community-maintained data scraping/scrapers: the Police Data Accessibility Project. They even have some kind of Python GUI for helping users write scrapers (note: I haven't checked this out yet).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Canadian COVID-19 Data Archive Issues directly related to the Canadian COVID-19 Data Archive list of datasets Maintenance or additions to the list of datasets, including metadata metadata Metadata for archived or derived datasets sustainability Long-term sustainability of the project
Projects
None yet
Development

No branches or pull requests

1 participant