Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue #2590] Replace gh in analytics ETL #3393

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

widal001
Copy link
Collaborator

@widal001 widal001 commented Jan 5, 2025

Summary

Replaces the sub-process call to the gh CLI by replacing it with a GitHubGraphqlClient class that can make calls to the GitHub GraphQL library directly from python.

Fixes #2590

Time to review: 10 mins

Changes proposed

What was added, updated, or removed in this PR.

  • Adds a GitHubGraphqlClient class that can make paginated calls to the GitHub GraphQL API
  • Replaces the calls to a separate sub-process in src/analytics/etl/github/main.py with the GitHubGraphqlClient
  • Removes the make-graphql-call.sh script that previously invoked the gh CLI
  • Removes the `

Context for reviewers

Testing instructions, background context, more in-depth details of the implementation, and anything else you'd like to call out or ask reviewers. Explain how the changes were verified.

Instructions to test

  1. Checkout the PR locally
  2. Re-build the docker image: make build
  3. Run the GitHub data export and sprint reports end-to-end: make sprint-reports-with-latest-data

Notes

We'll want to refactor the src/analytics/integrations/github/ sub-package a little bit further pulling most of the code in the main.py file in that sub-package into src/analytics/etl/github.py instead.

I didn't include that in this PR to try to minimize the amount of code I was changing, but we can/should tackle that refactor in #3203 because some of the functions in main.py still write to the local file system, but can easily be updated to pass the exported data as a python dictionary.

Additional information

Screenshots, GIF demos, code examples or output to help show the changes working as expected.

The local run of sprint reports with the new code matches the output of the last run triggered by AWS step functions (using code in main) posted to slack:

Sprint report for HHS/13

In Slack (based on main)

Screenshot 2025-01-04 at 7 46 28 PM

Locally, based on this feature branch:

Screenshot 2025-01-04 at 7 47 22 PM

Sprint burndown for HHS/17

In Slack (based on main)

Screenshot 2025-01-04 at 7 48 06 PM

Locally, based on this feature branch:

Screenshot 2025-01-04 at 7 47 50 PM

Deliverable percent complete

In Slack (based on main)

Screenshot 2025-01-04 at 7 49 02 PM

Locally, based on this feature branch:

Screenshot 2025-01-04 at 7 48 45 PM

@widal001 widal001 requested review from DavidDudas-Intuitial and coilysiren and removed request for acouch and coilysiren January 5, 2025 00:50
@@ -18,7 +18,6 @@ RUN apt-get update \
libpq-dev \
postgresql \
wget \
jq \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing jq because we no longer need it for transformations

Comment on lines -37 to -38
# Install gh CLI
# docs: https://github.com/cli/cli/blob/trunk/docs/install_linux.md
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this script because we no longer need the gh CLI

@@ -19,6 +19,7 @@ class DBSettings(PydanticBaseEnvConfig):
ssl_mode: str = Field("require", alias="DB_SSL_MODE")
db_schema: str = Field ("app", alias="DB_SCHEMA")
slack_bot_token: str = Field(alias="ANALYTICS_SLACK_BOT_TOKEN")
github_token: str = Field(alias="GH_TOKEN")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this because we now need to reference it directly within the codebase, instead of indirectly like we did previously with the gh CLI

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are in this file, can we rename DBSettings to something more accurate

###########################
# Do not add these values to this file
# to avoid mistakenly committing them.
# Set these in your shell
# by doing `export ANALYTICS_REPORTING_CHANNEL_ID=whatever`
ANALYTICS_REPORTING_CHANNEL_ID=DO_NOT_SET_HERE
ANALYTICS_SLACK_BOT_TOKEN=DO_NOT_SET_HERE
GH_TOKEN=DO_NOT_SET_HERE
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prevents tests from failing if someone hasn't set their GitHub token locally.

Comment on lines -71 to -72
"ANN101", # missing type annotation for self
"ANN102", # missing type annotation for cls
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these because they've been removed in the latest version of ruff

@@ -78,7 +76,6 @@ ignore = [
"PTH123", # `open()` should be replaced by `Path.open()`
"RUF012", # Mutable class attributes should be annotated with `typing.ClassVar`
"TD003", # missing an issue link on TODO
"PT004", # pytest fixture leading underscore - is marked deprecated
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This files is basically a complete refactor, but preserves the existing helper functions for the export to prevent this PR from getting bigger than it already is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removes this because we no longer need it

@@ -40,7 +40,7 @@ def test_init(
records = caplog.records
assert len(records) == 2
assert re.match(
r"^start test_logging: \w+ [0-9.]+ \w+, hostname \S+, pid \d+, user \d+\(\w+\)$",
r"^start test_logging: \w+ [0-9.]+ \w+, hostname \S+, pid \d+, user \d+\([\w\.]+\)",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this because the tests were failing locally if there was a period in the username, e.g. billy.daly

Copy link
Collaborator

@coilysiren coilysiren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have (1) significant question about the data formatting, everything else looks fine

analytics/local.env Outdated Show resolved Hide resolved
{
"project_owner": owner,
"project_number": project,
"issue_title": safe_pluck(item, "content.title"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need safe_pluck. If there's a bunch of fields missing, I would rather the code raise a keyerror, instead of getting us bad (eg. mostly null) data.

Copy link
Collaborator Author

@widal001 widal001 Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the concern, but it's technically valid for all of these attributes to be empty in GitHub, except issue_title, issue_url and issue_opened_at. For example this issue has issue_type and issue_status but everything else is blank (e.g. sprint, parent, points, etc.)

We're currently validating the output data using the IssueMetada pydantic class when we parse these items in this step

We could have the non-nullable fields fail with a KeyError at this step, but the pydantic validation gives us better debugging output and allows us to gracefully continue with exporting and transforming the rest of the issues.

Copy link
Collaborator Author

@widal001 widal001 Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to be more strict with what we consider "valid" data, I could see us requiring issue_type and issue_status as well.

Although since the logs are only retained for a limited amount of time, having issues without a type or status get "silently" dropped is often less helpful than having them with null data in Metabase.

The broader strategy around data quality and effectively handling a "dead letter queue" of bad data is the subject of this epic on data quality checks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely welcome your thoughts, though, on other potential strategies here as an intermediate step to implementing more robust data quality checks!

I think the basic things we're trying to achieve in the transform step are:

  • Prevent "bad" data from being inserted into the database (i.e. data that is missing required columns, or data that is missing optional columns because of a bug in the ETL -- the latter one is harder to check for)
  • Support inserts of data that are missing optional columns, when they are valid
  • Prevent failures of a subset of data from blocking loads of the remaining valid data

Typically I've achieved these goals by using tools like Great Expectations or Anomalo which run on the entire data set to check for quality issues or anomalies, but there might be immediate steps we can take right now to block bad data or catch more programming errors upfront.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are already using Pydantic, we can use it's nested models feature:

https://stackoverflow.com/questions/70302056/define-a-pydantic-nested-model

Then drop this translation and safe_pluck layer entirely, and rely entirely on Pydantic to do the validation and null data transformation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @chouinar I would be interested in your thoughts on my "rely entirely on Pydantic" assertion here. Going to request changes since I feel like this is in fact blocking, would rather do it right than do it fast.

Copy link
Collaborator

@DavidDudas-Intuitial DavidDudas-Intuitial left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work, especially on the tests

{
"project_owner": owner,
"project_number": project,
"issue_title": safe_pluck(item, "content.title"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @chouinar I would be interested in your thoughts on my "rely entirely on Pydantic" assertion here. Going to request changes since I feel like this is in fact blocking, would rather do it right than do it fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Call GitHub graphql API directly from python
3 participants