-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue #2590] Replace gh
in analytics ETL
#3393
base: main
Are you sure you want to change the base?
Conversation
This test used to fail if the username contained a dot (e.g. `first.last`) This commit adjusts the regex to allow usernames with dots
Adds a class to make calls to the Github GraphQL API to replace gh CLI
To analytics.integrations.github.client
After the refactor, we no longer need them
@@ -18,7 +18,6 @@ RUN apt-get update \ | |||
libpq-dev \ | |||
postgresql \ | |||
wget \ | |||
jq \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing jq
because we no longer need it for transformations
# Install gh CLI | ||
# docs: https://github.com/cli/cli/blob/trunk/docs/install_linux.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this script because we no longer need the gh
CLI
@@ -19,6 +19,7 @@ class DBSettings(PydanticBaseEnvConfig): | |||
ssl_mode: str = Field("require", alias="DB_SSL_MODE") | |||
db_schema: str = Field ("app", alias="DB_SCHEMA") | |||
slack_bot_token: str = Field(alias="ANALYTICS_SLACK_BOT_TOKEN") | |||
github_token: str = Field(alias="GH_TOKEN") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this because we now need to reference it directly within the codebase, instead of indirectly like we did previously with the gh
CLI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are in this file, can we rename DBSettings
to something more accurate
########################### | ||
# Do not add these values to this file | ||
# to avoid mistakenly committing them. | ||
# Set these in your shell | ||
# by doing `export ANALYTICS_REPORTING_CHANNEL_ID=whatever` | ||
ANALYTICS_REPORTING_CHANNEL_ID=DO_NOT_SET_HERE | ||
ANALYTICS_SLACK_BOT_TOKEN=DO_NOT_SET_HERE | ||
GH_TOKEN=DO_NOT_SET_HERE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevents tests from failing if someone hasn't set their GitHub token locally.
"ANN101", # missing type annotation for self | ||
"ANN102", # missing type annotation for cls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed these because they've been removed in the latest version of ruff
@@ -78,7 +76,6 @@ ignore = [ | |||
"PTH123", # `open()` should be replaced by `Path.open()` | |||
"RUF012", # Mutable class attributes should be annotated with `typing.ClassVar` | |||
"TD003", # missing an issue link on TODO | |||
"PT004", # pytest fixture leading underscore - is marked deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This files is basically a complete refactor, but preserves the existing helper functions for the export to prevent this PR from getting bigger than it already is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removes this because we no longer need it
@@ -40,7 +40,7 @@ def test_init( | |||
records = caplog.records | |||
assert len(records) == 2 | |||
assert re.match( | |||
r"^start test_logging: \w+ [0-9.]+ \w+, hostname \S+, pid \d+, user \d+\(\w+\)$", | |||
r"^start test_logging: \w+ [0-9.]+ \w+, hostname \S+, pid \d+, user \d+\([\w\.]+\)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this because the tests were failing locally if there was a period in the username, e.g. billy.daly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have (1) significant question about the data formatting, everything else looks fine
{ | ||
"project_owner": owner, | ||
"project_number": project, | ||
"issue_title": safe_pluck(item, "content.title"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need safe_pluck
. If there's a bunch of fields missing, I would rather the code raise a keyerror, instead of getting us bad (eg. mostly null) data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the concern, but it's technically valid for all of these attributes to be empty in GitHub, except issue_title
, issue_url
and issue_opened_at
. For example this issue has issue_type
and issue_status
but everything else is blank (e.g. sprint, parent, points, etc.)
We're currently validating the output data using the IssueMetada
pydantic class when we parse these items in this step
We could have the non-nullable fields fail with a KeyError
at this step, but the pydantic validation gives us better debugging output and allows us to gracefully continue with exporting and transforming the rest of the issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to be more strict with what we consider "valid" data, I could see us requiring issue_type
and issue_status
as well.
Although since the logs are only retained for a limited amount of time, having issues without a type or status get "silently" dropped is often less helpful than having them with null
data in Metabase.
The broader strategy around data quality and effectively handling a "dead letter queue" of bad data is the subject of this epic on data quality checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd definitely welcome your thoughts, though, on other potential strategies here as an intermediate step to implementing more robust data quality checks!
I think the basic things we're trying to achieve in the transform step are:
- Prevent "bad" data from being inserted into the database (i.e. data that is missing required columns, or data that is missing optional columns because of a bug in the ETL -- the latter one is harder to check for)
- Support inserts of data that are missing optional columns, when they are valid
- Prevent failures of a subset of data from blocking loads of the remaining valid data
Typically I've achieved these goals by using tools like Great Expectations or Anomalo which run on the entire data set to check for quality issues or anomalies, but there might be immediate steps we can take right now to block bad data or catch more programming errors upfront.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are already using Pydantic, we can use it's nested models feature:
https://stackoverflow.com/questions/70302056/define-a-pydantic-nested-model
Then drop this translation and safe_pluck layer entirely, and rely entirely on Pydantic to do the validation and null data transformation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @chouinar I would be interested in your thoughts on my "rely entirely on Pydantic" assertion here. Going to request changes since I feel like this is in fact blocking, would rather do it right than do it fast.
Co-authored-by: kai [they] <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice work, especially on the tests
{ | ||
"project_owner": owner, | ||
"project_number": project, | ||
"issue_title": safe_pluck(item, "content.title"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @chouinar I would be interested in your thoughts on my "rely entirely on Pydantic" assertion here. Going to request changes since I feel like this is in fact blocking, would rather do it right than do it fast.
Summary
Replaces the sub-process call to the
gh
CLI by replacing it with aGitHubGraphqlClient
class that can make calls to the GitHub GraphQL library directly from python.Fixes #2590
Time to review: 10 mins
Changes proposed
GitHubGraphqlClient
class that can make paginated calls to the GitHub GraphQL APIsrc/analytics/etl/github/main.py
with theGitHubGraphqlClient
make-graphql-call.sh
script that previously invoked thegh
CLIContext for reviewers
Instructions to test
make build
make sprint-reports-with-latest-data
Notes
We'll want to refactor the
src/analytics/integrations/github/
sub-package a little bit further pulling most of the code in themain.py
file in that sub-package intosrc/analytics/etl/github.py
instead.I didn't include that in this PR to try to minimize the amount of code I was changing, but we can/should tackle that refactor in #3203 because some of the functions in
main.py
still write to the local file system, but can easily be updated to pass the exported data as a python dictionary.Additional information
The local run of sprint reports with the new code matches the output of the last run triggered by AWS step functions (using code in
main
) posted to slack:Sprint report for HHS/13
In Slack (based on
main
)Locally, based on this feature branch:
Sprint burndown for HHS/17
In Slack (based on
main
)Locally, based on this feature branch:
Deliverable percent complete
In Slack (based on
main
)Locally, based on this feature branch: