Monitor Critical GitHub Actions Workflows Across Organization Repositories #4941

rishabh6788 · 2024-08-13T17:34:41Z

Is your feature request related to a problem? Please describe

Background

We recently had a situation where publish snapshots to maven github actions workflow started failing across all the repositories due to an issue on sonatype central side. They had accidently deleted user tokens during maintainence and our jobs started failing with 401 errors.
The operator accidently happen to check the failed workflow on the commit they merged and saw snapshot workflow failure, upon further investigation it was found that the same workflow had been failing across all the repositories with same error for past 24-hours.
We need to implement a system to monitor critical GitHub Actions workflows across multiple repositories in our organization. This will help us quickly identify and respond to workflow failures or issues.

Describe the solution you'd like

Proposed Solutions

We have identified two broad categories of approaches: pull-based and push-based monitoring.

1. Pull-based Monitoring

Description

a) Oboard github actions workflow metrics onto existing metrics framework (Recommended)

Use exisintg metrics frame to onboard github actions metrics
Add monitor on failure metric and notify in slack channel, already implemented

b) Use GitHub REST APIs to periodically fetch the GitHub Actions status

Index the collected data in an OpenSearch cluster
Implement a pull job that runs on a cron schedule in Jenkins
Use GitHub REST APIs to periodically fetch the GitHub Actions status
Index the collected data in an OpenSearch cluster
Implement a pull job that runs on a cron schedule in Jenkins

Advantages

Centralized monitoring solution
Can provide historical data and trends
Allows for custom alerting based on various criteria

Challenges

May have slight delay in detecting issues due to polling interval
Need to manage API rate limits

2. Push-based Monitoring

Description

a) Slack Notifications Integration in Workflows

Add a Slack action to critical workflows
Configure the action to send a Slack message notification when a job fails

b) Email Notifications

Use GitHub's built-in email notification system or a custom email action
Send detailed email reports for workflow failures

d) Webhook Integration

Set up a custom webhook endpoint in our infrastructure
Configure GitHub to send workflow status updates to this endpoint
Process incoming webhooks to trigger appropriate actions (e.g., update a status page, send notifications)

Advantages

Real-time notifications
Simple to set up and maintain
No additional infrastructure required
Notification storm/fatigue during multiple failures across all repos.

Challenges

No centralized data storage for historical analysis
Requires updating each workflow file individually

Next Steps

Discuss and decide on the preferred approach (pull-based, push-based, or a combination)
Create a detailed implementation plan for the chosen approach(es)
Assign team members to various tasks
Set up a timeline for implementation and testing
Plan for gradual rollout and monitoring of the new system

Questions to Consider

What defines a "critical" workflow in our organization?
How quickly do we need to be notified of issues?
Do we need historical data for analysis, or are real-time alerts sufficient?
Who should receive notifications, and how should they be prioritized?
How will we handle false positives or transient failures?

Please comment with your thoughts, preferences, or any additional considerations for this monitoring system.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

rishabh6788 · 2024-08-13T17:35:33Z

Tagging @peterzhuamazon @gaiksaya @getsaurabh02 @prudhvigodithi @dblock for feedback and way forward.

prudhvigodithi · 2024-08-13T17:58:33Z

Thanks @rishabh6788 this this an important enhancement. With the gathered data of GitHub Action Workflows we can even have summary of force merged pull requests, which is an important metric for the OpenSearch repo health. @getsaurabh02 @dblock

I would vote for 1st option to collect the incremental PR workflows, index the data and create a monitoring tool on top of the indexed raw data. Going with option 2, even if we created a custom GitHun action for this purpose it would be tough to update the 100's of workflows files across all the repos and ensuring that for new repos this action exists is tedious job.
If we go with solution 1 and running the workflow more aggressively to just monitor the incremental PR workflows would reduce the delay in detecting issues.

Thank you

peterzhuamazon · 2024-08-13T18:31:16Z

I am also in line with the pull based monitoring and carefully choose the data source we want to monitor. However, there will be still gaps where certain actions only run once per a month during release phase.

We need to figure out a consistent way to dry-run these actions in order to detect issues beforehand.

Thanks.

prudhvigodithi · 2024-08-22T03:58:50Z

Going with option 1 we can do the following:

Today, the metrics code collects the daily incremental PRs (updated, created, merged, closed) across all repositories.
For the list of the PR's that are retrieved, index the head commit. Example https://api.github.com/repos/opensearch-project/dashboards-observability/pulls/2084

Now within the same scope or a seperate process use check-runs API from GitHub to get the CI runs for the associated commit, example https://api.github.com/repos/opensearch-project/query-insights/check-runs/29083082462

Example https://api.github.com/repos/opensearch-project/query-insights/commits/1f4c4c635d6704e637004e9f363735461db21c2d/check-runs

Now the check-runs gives all the information of the CI runs for that commit (coming from a PR) and index the relevant important information like name, status, conclusion etc.
Build the monitoring tool around the indexed data, but running a query on the cluster and find the runs with "conclusion": "failure",, we can even target the specific runs for example "name": "build-and-publish-snapshots" which has conclusion as failures.
We can even use this information to get a new metrics (Force merged PR's and its trend) to find the PR's that are force merged with the failing CI checks.

@getsaurabh02 @dblock @rishabh6788 @peterzhuamazon @gaiksaya

prudhvigodithi · 2024-09-06T16:41:51Z

Following is the sample schema that can be indexed to the metrics cluster.

{
  id: <The id of the workflow run and can be directly used as document ID, directly given as part of check-runs API response >
  repository: <The Repo name>
  organization: <Optional: The Repo org>
  number: <PR number for which the workflow has triggered>
  pull_commit: <The head commit of the PR for which the workflow has triggered, should be inferred from pull API>
  merged: <The current state of the PR if merged true/false, should be inferred from pull API>
  commit_id: <The Commit ID of the PR for which the workflow has triggered, this commit should be inferred from pull API>
  html_url: <The html_url of the workflow run, directly given as part of check-runs API response>
  url: <The url of the workflow run, directly given as part of check-runs API response>
  name: <The name of the workflow run, directly given as part of check-runs API response>
  conclusion: <The result of the workflow run, directly given as part of check-runs API response>
  started_at: <The started timestamp of the workflow run, directly given as part of check-runs API response>
  completed_at: <The completed timestamp of the workflow run, directly given as part of check-runs API response>
}

Once we have the above information:

We should be able to monitor the desired workflows.
Create visualizations and trend graphs of repos with failing CI workflows and ability to filter per repo.
Monitor and create visualizations of repos where PR's are merged without the passing CI's.
Create issues with directly PR and workflow run information and URl's.

Thank you
@rishabh6788 @getsaurabh02

prudhvigodithi · 2024-09-24T17:50:46Z

Did some more deep dive on the possible repo workflows.

To check all the possible action runs at the repo level (part of the .github/workflows), example
https://api.github.com/repos/opensearch-project/opensearch-build/actions/runs?per_page=100&created=2024-09-22..2024-09-23. This should give all the action workflows triggered by all possible events https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows.
However the above API does not show the app based runs, which are of type check-runs (runs like mend and DCO). So to see the status and monitor these type of runs we should get the head_commit and use the API https://api.github.com/repos/opensearch-project/opensearch-build/commits/51b8b104ee98251aa8d38c24c2b9791a9206c5df/check-runs to see the status of the app based runs.
Here is a small scenario for the repo and for an event the DCO action failed https://github.com/opensearch-project/opensearch-build/runs/30403041967, but the DCO failure is not recorded in actions/runs https://api.github.com/repos/opensearch-project/opensearch-build/actions/runs?per_page=100&created=2024-08-22..2024-09-23&head_sha=51b8b104ee98251aa8d38c24c2b9791a9206c5df since the DCO is not part of .github/workflows and for this we should use https://api.github.com/repos/opensearch-project/opensearch-build/commits/51b8b104ee98251aa8d38c24c2b9791a9206c5df/check-runs.
Coming from this comment Monitor Critical GitHub Actions Workflows Across Organization Repositories #4941 (comment) if we target to monitor the workflows only part of the PR, we will end up missing workflows part of the repo that are not always triggered by a PR (and the PR events). So we should use https://docs.github.com/en/rest/actions/workflow-runs?apiVersion=2022-11-28 and at the same time for app based check-runs we should be using check-runs API based on the head commit.

peterzhuamazon · 2024-09-24T23:16:11Z

Sync up with Prudhvi today and confirm that automation app is able to grab all the necessary context for the requirements.

We will see if we can combine the automation app and metrics cluster together on this.

Thanks.

prudhvigodithi · 2024-10-08T16:21:29Z

Here is the final flow details, implemented based on all the merged pull requests linked to this issue.

graph LR
    A[GitHub Workflow Events] --> B[GitHub Automation App]
    B --> C[Failure Detection]
    C --> D[Workflow Failure Identified]
    D --> E[CloudWatch Alarms Update]
    D --> F[Failures Indexed]
    E --> I{Alarm Triggered?}
    I -- Yes --> G[Alerts Sent to Teams]
    I -- No --> J[No Action]
    F --> H[Data for Debugging and Trend Analysis]

prudhvigodithi · 2024-10-08T16:51:57Z

Closing this issue.
@rishabh6788 @getsaurabh02

rishabh6788 added enhancement New Enhancement untriaged Issues that have not yet been triaged labels Aug 13, 2024

github-project-automation bot added this to Engineering Effectiveness Board Aug 13, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Aug 13, 2024

gaiksaya removed the untriaged Issues that have not yet been triaged label Aug 15, 2024

peterzhuamazon assigned rishabh6788 and prudhvigodithi Sep 23, 2024

peterzhuamazon moved this from Planned (Next Quarter) to 🏗 In progress in Engineering Effectiveness Board Sep 23, 2024

prudhvigodithi closed this as completed Oct 8, 2024

github-project-automation bot moved this from 🏗 In progress to ✅ Done in Engineering Effectiveness Board Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor Critical GitHub Actions Workflows Across Organization Repositories #4941

Monitor Critical GitHub Actions Workflows Across Organization Repositories #4941

rishabh6788 commented Aug 13, 2024 •

edited

Loading

rishabh6788 commented Aug 13, 2024

prudhvigodithi commented Aug 13, 2024

peterzhuamazon commented Aug 13, 2024 •

edited

Loading

prudhvigodithi commented Aug 22, 2024 •

edited

Loading

prudhvigodithi commented Sep 6, 2024 •

edited

Loading

prudhvigodithi commented Sep 24, 2024 •

edited

Loading

peterzhuamazon commented Sep 24, 2024

prudhvigodithi commented Oct 8, 2024

prudhvigodithi commented Oct 8, 2024

Monitor Critical GitHub Actions Workflows Across Organization Repositories #4941

Monitor Critical GitHub Actions Workflows Across Organization Repositories #4941

Comments

rishabh6788 commented Aug 13, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Background

Describe the solution you'd like

Proposed Solutions

1. Pull-based Monitoring

Description

Advantages

Challenges

2. Push-based Monitoring

Description

Advantages

Challenges

Next Steps

Questions to Consider

Describe alternatives you've considered

Additional context

rishabh6788 commented Aug 13, 2024

prudhvigodithi commented Aug 13, 2024

peterzhuamazon commented Aug 13, 2024 • edited Loading

prudhvigodithi commented Aug 22, 2024 • edited Loading

prudhvigodithi commented Sep 6, 2024 • edited Loading

Following is the sample schema that can be indexed to the metrics cluster.

Once we have the above information:

prudhvigodithi commented Sep 24, 2024 • edited Loading

peterzhuamazon commented Sep 24, 2024

prudhvigodithi commented Oct 8, 2024

prudhvigodithi commented Oct 8, 2024

rishabh6788 commented Aug 13, 2024 •

edited

Loading

peterzhuamazon commented Aug 13, 2024 •

edited

Loading

prudhvigodithi commented Aug 22, 2024 •

edited

Loading

prudhvigodithi commented Sep 6, 2024 •

edited

Loading

prudhvigodithi commented Sep 24, 2024 •

edited

Loading