Skip to content
This repository has been archived by the owner on Jan 12, 2024. It is now read-only.

Automate generation of EPA CEMS metadata for data catalog export #2

Open
8 tasks
Tracked by #1564
zaneselvans opened this issue Apr 6, 2022 · 0 comments
Open
8 tasks
Tracked by #1564
Labels
epacems The EPA's Continuous Emissions Monitoring System hourly dataset inframundo intake Intake data catalogs metadata Data about our liberated data

Comments

@zaneselvans
Copy link
Member

We want to integrate column and table metadata (e.g. text descriptions) into the source definition in pudl_catalog.yaml so that users can understand what data is available when browsing the catalog. This information is currently being written into the column and table metadata within the Parquet files during ETL, so it could be read from there. It could be exported from our Pydantic metadata models when we generate pudl_catalog.yaml.

  • Identify or create an appropriate structure / format for table & column level metadata in the pudl_catalog.yaml. This should include at least:
    • Text description of the table (Resource.description)
    • Primary key of the table (Resource.schema.primary_key)
    • Text descriptions for each column (Field.description)
    • Licensing terms for the data (Resource.license)
    • The original source of the data (Resource.sources)
    • Creator(s) / Maintainer(s) of the dataset (Resource.contributors)
  • Add a Resource.to_intake_data_source() method that can generate the Intake data source level metadata entry.
@zaneselvans zaneselvans added intake Intake data catalogs epacems The EPA's Continuous Emissions Monitoring System hourly dataset metadata Data about our liberated data labels Apr 6, 2022
@jdangerx jdangerx moved this to 🆕 New in Catalyst Megaproject Feb 7, 2023
@jdangerx jdangerx moved this from 🆕 New to 📋 Backlog in Catalyst Megaproject Feb 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
epacems The EPA's Continuous Emissions Monitoring System hourly dataset inframundo intake Intake data catalogs metadata Data about our liberated data
Projects
Status: Icebox
Development

No branches or pull requests

2 participants