Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mtiemann os climate rmi cleanup #72

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,13 @@ JSON source that can be ingested by the ITR tool. Presently 14 sectors are suppo
- Chemical Industry
- Textiles and Leather

A logical (and welcomed) next step would be to curate this data within our Trino database (with proper metedata descriptions).

The notebook [ITR-data-production](ITR-data-production.ipynb) synthesizes a set of corporate data from a variety of public sources, including [GLEIF](https://www.gleif.org/en) legal entity
identifiers, [SEC financial disclosures](https://www.sec.gov/edgar/searchedgar/companysearch), [US Census data](https://www.census.gov/data.html), [RMI-curated production
data](https://utilitytransitionhub.rmi.org/data-download/), and some hand-curated sources as well.

Most importantly, this pipeline puts the financial, production, emissions, and other data into Trino so that the ITR can access it via the [Data Commons](https://github.com/os-climate/os_c_data_commons).

A logical (and welcomed) next step would be to curate this data within our Trino database (with proper metedata descriptions for all data, not just RMI Utility Transition Hub data).

If you have questions, please file [Issues](https://github.com/os-climate/itr-data-pipeline/issues). If you have answers, please contribute [Pull
Requests](https://github.com/os-climate/itr-data-pipeline/pulls)!
Binary file added data/processed/template-20220415-output.xlsx
Binary file not shown.
29 changes: 29 additions & 0 deletions dbt/rmi_transform/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# RMI Utility Transition Hub Ingestion Pipeline

This pipeline processes [data published by the RMI Utility Transition Data Hub](https://utilitytransitionhub.rmi.org/data-download/) team. It aligns named corporate entities with [Global Legal Entity Idenitifiers](https://www.gleif.org/en), performs some minor data cleaning, and adds metadata.

If you have questions, please file [Issues](https://github.com/os-climate/rmi-utility-transition-hub-ingestion-pipeline/issues). If you have answers, please contribute [Pull Requests](https://github.com/os-climate/rmi-utility-transition-hub-ingestion-pipeline/pulls)!

The principal ingestion code can be found in the [notebooks](notebooks) directory. At present there are two steps in the pipeline:

1. Extract and Load (which loads data into Trino, builds the DBT transformas, and initializes metadata for Open Metadata). We do not use `Pachyderm` at the moment because it has fallen behind in terms of dependency pins that interferes with recent (since May 2022) versions of `dbt` and that holds back other progress we want to make.
2. dbt data transformation (documented here).
a. Remember to connect dbt with `profiles.yml` (which defaults to ~/.dbt/profiles.yml)
b. From CLI, `dbt run --profiles-dir=XYZZY` will dig profiles.yml out of XYZZY
c. `dbt test --profiles-dir=XYZZY` currently does nothing

Remember also that Jupyter Notebooks create checkpoint files, which disturbs `dbt` if they appear within the dbt folder hierarchy.
The best work-around is to run the Jupyter notebook environment like so: `jupyter lab --FileContentsManager.checkpoints_kwargs="root_dir"="/tmp"`
A chaotic-neutral work-around is to frequently and liberally execute `find ../.. -name \*checkpoint\* -exec rm -rf {} \;`

## Resources

- Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction)
- Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers
- Join the [chat](https://community.getdbt.com/) on Slack for live discussions and support
- Find [dbt events](https://events.getdbt.com) near you
- Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices

Metadata for the tables we have ingested can be viewed from our [OpenMetadata portal](https://openmetadata-openmetadata.apps.odh-cl2.apps.os-climate.org/explore/tables/?searchFilter=databaseschema%3Drmi) (GitHub User ID and ODH User access tokens required).

If you have questions, please file [Issues](https://github.com/os-climate/itr-data-pipeline/issues). If you have answers, please contribute [Pull Requests](https://github.com/os-climate/itr-data-pipeline/pulls)!
37 changes: 37 additions & 0 deletions dbt/rmi_transform/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@

# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'rmi_transform'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'rmi_transform'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"


# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
rmi_transform:
materialized: view
+view_security: invoker
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/assets_earnings_investments.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select parent_name, utility_name, respondent_id, year, asset, sub_asset, asset_value, earnings_value, investment_value
from osc_datacommons_dev.rmi.assets_earnings_investments_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/customers_sales.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select parent_name, utility_name, respondent_id, year, customer_type, customer_type_rmi, customers, sales, revenues
from osc_datacommons_dev.rmi.customers_sales_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/debt_equity_returns.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select parent_name, utility_name, respondent_id, year, rate_base_actual, equity_actual, debt_actual, equity_ratio_actual, returns_actual, earnings_actual, interest_actual, fed_tax_expense_actual, pre_tax_net_income_actual, ror_actual, roe_actual, interest_rate_actual, equity_ratio, ror, roe, interest_rate, effective_fed_tax_rate, equity_authorized, debt_authorized, returns_authorized, earnings_authorized, interest_authorized, interest_rate_authorized
from osc_datacommons_dev.rmi.debt_equity_returns_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/emissions_targets.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select parent_name, utility_name, respondent_id, year, target_scope, target_type, state, co2_historical, co2_target, co2_target_all_years, co2_1point5c, generation_historical, generation_projected, generation_1point5c, co2_intensity_historical, co2_intensity_target, co2_intensity_target_all_years, co2_intensity_1point5c
from osc_datacommons_dev.rmi.emissions_targets_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/employees.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select parent_name, utility_name, respondent_id, year, technology, employees
from osc_datacommons_dev.rmi.employees_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/expenditure_bills_burden.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select parent_name, utility_name, respondent_id, year, percent_ami, ownership, electricity_gas_other, technology, expenditure, bill, burden
from osc_datacommons_dev.rmi.expenditure_bills_burden_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/housing_units_income.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select parent_name, utility_name, respondent_id, year, percent_ami, ownership, housing_units, income
from osc_datacommons_dev.rmi.housing_units_income_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/net_plant_balance.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select parent_name, utility_name, respondent_id, year, ferc_class, original_cost, accum_depr, net_plant_balance, arc, arc_accum_depr, net_arc
from osc_datacommons_dev.rmi.net_plant_balance_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/operations_emissions_by_fuel.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select year, parent_name, utility_name, utility_id_eia, utility_type_rmi, plant_id_eia, plant_name_eia, generator_id, state, city, county, latitude, longitude, balancing_authority_code_eia, balancing_authority_name_eia, iso_rto_code, nerc_region, operational_status_code, operating_month, operating_year, retirement_month, retirement_year, energy_source, owned_energy_source, technology_eia, technology_rmi, energy_source_code, fuel_type_category, net_generation, fuel_consumed, emissions_co2, emissions_nox, emissions_sox
from osc_datacommons_dev.rmi.operations_emissions_by_fuel_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/operations_emissions_by_tech.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select year, parent_name, utility_name, utility_id_eia, utility_type_rmi, plant_id_eia, plant_name_eia, generator_id, state, city, county, latitude, longitude, balancing_authority_code_eia, balancing_authority_name_eia, iso_rto_code, nerc_region, operational_status_code, operating_month, operating_year, retirement_month, retirement_year, energy_source, owned_energy_source, technology_eia, technology_rmi, capacity, year_end_capacity, net_generation, potential_generation, capacity_factor, fuel_consumed, emissions_co2, emissions_nox, emissions_sox
from osc_datacommons_dev.rmi.operations_emissions_by_tech_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/revenue_by_tech.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select parent_name, utility_name, respondent_id, year, technology, component, detail, revenue_total, revenue_residential
from osc_datacommons_dev.rmi.revenue_by_tech_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/state_policies.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select state, state_abbr, securitization_policy, market_indexing_policy, fuel_pass_through, governor_party, legislation_majority_party
from osc_datacommons_dev.rmi.state_policies_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/state_targets.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select state, year, year_type, legal_standard, enforcement_standard, target_type, target_value
from osc_datacommons_dev.rmi.state_targets_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/utility_information.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select parent_name, parent_lei, parent_ticker, parent_isin, utility_name, respondent_id, entity_id, utility_id_eia, utility_lei, entity_type_eia, utility_type_rmi, first_report_year, last_report_year, duplicate_utility_id_eia
from osc_datacommons_dev.rmi.utility_information_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/utility_information_2023.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select parent_name, parent_lei, ticker, isin, utility_name, utility_id_ferc1, utility_id_ferc1_dbf, utility_id_ferc1_xbrl, utility_id_eia, utility_lei, fraction_owned_utility, entity_type_eia, utility_type_rmi, public_private_unmapped, duplicate_utility_id_eia
from osc_datacommons_dev.rmi.utility_information_2023_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/utility_state_map.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select parent_name, utility_name, respondent_id, year, state, state_abbr, capacity_owned_in_state, capacity_operated_in_state, mwh_sales_in_state
from osc_datacommons_dev.rmi.utility_state_map_source
)
select * from source_data
6 changes: 6 additions & 0 deletions dbt/rmi_transform/models/utility_state_map_2023.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{{ config(materialized='view', view_security='invoker') }}
with source_data as (
select parent_name, utility_name, utility_id_eia, year, state, state_abbr, capacity_owned_in_state, capacity_operated_in_state, mwh_sales_in_state
from osc_datacommons_dev.rmi.utility_state_map_2023_source
)
select * from source_data
Loading
Loading