Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store gitlab failure reason/exit code in analytics DB #1042

Merged
merged 6 commits into from
Jan 30, 2025

Conversation

mvandenburgh
Copy link
Member

@mvandenburgh mvandenburgh commented Jan 22, 2025

This PR updates the webhook-handler to store the build_failure_reason reported by gitlab (for example, script_failure) in the job data dimension table. It also attempts to parse out the exit code from the job log and store it if found.

A couple things to note-

  1. From what I can tell, gitlab does not store the job exit code anywhere. So, I opted to parse it out of the job log with a regex. If there's a better way/it is in fact stored somewhere in the gitlab DB, I'd be happy to change this. Updated PR to use DB instead

  2. I suspect that only script-failure jobs have an exit code, so we should be able to add a constraint that enforces exit code to be non-null for those jobs. I don't have any evidence for this yet, so I figured we can leave it as null-able for now and revisit it later when we have some data.

  3. I've added a temporary management command to backfill the new fields; I didn't want to make it a migration, because long-running migrations don't work great with our current process for running them.

@zackgalbreath
Copy link
Collaborator

  1. From what I can tell, gitlab does not store the job exit code anywhere

I believe this is stored in the GitLab database as ci_builds_metadata.exit_code

@mvandenburgh
Copy link
Member Author

  1. From what I can tell, gitlab does not store the job exit code anywhere

I believe this is stored in the GitLab database as ci_builds_metadata.exit_code

I made a query in metabase that grabs all the build_failure_reasons from the webhook logs table, and joins it to the ci_builds_metadata table to check this https://metabase.spack.io/question/113-gitlab-db-script-failure-jobs-w-exit-codes-joined (note - that query takes ~20 seconds to run), and it looks like that column is NULL most of the time. And if you remove the AND ("public"."ci_builds_metadata"."exit_code" IS NOT NULL) line from that query, you'll see all the script_failure jobs don't have exit_codes. 🤷‍♂️

@mvandenburgh mvandenburgh marked this pull request as draft January 23, 2025 21:30
@mvandenburgh mvandenburgh removed the request for review from jjnesbitt January 23, 2025 21:30
@mvandenburgh
Copy link
Member Author

  1. From what I can tell, gitlab does not store the job exit code anywhere

I believe this is stored in the GitLab database as ci_builds_metadata.exit_code

I made a query in metabase that grabs all the build_failure_reasons from the webhook logs table, and joins it to the ci_builds_metadata table to check this metabase.spack.io/question/113-gitlab-db-script-failure-jobs-w-exit-codes-joined (note - that query takes ~20 seconds to run), and it looks like that column is NULL most of the time. And if you remove the AND ("public"."ci_builds_metadata"."exit_code" IS NOT NULL) line from that query, you'll see all the script_failure jobs don't have exit_codes. 🤷‍♂️

This might be incorrect. Marking this as draft until I can confirm

@mvandenburgh mvandenburgh force-pushed the add-failure-reason-exit-code-fields branch from 2af4b29 to ea78016 Compare January 24, 2025 01:04
@mvandenburgh mvandenburgh force-pushed the add-failure-reason-exit-code-fields branch from 13f3281 to 7ab99e7 Compare January 24, 2025 01:22
@mvandenburgh
Copy link
Member Author

There must be something going on with the regex in that metabase query, because it does look like at least script_failure jobs always have a non-null exit code. I updated the PR to use that instead.

@mvandenburgh mvandenburgh marked this pull request as ready for review January 24, 2025 15:43
@mvandenburgh
Copy link
Member Author

For posterity - @jjnesbitt and I discussed this offline and decided to move the backfilling of exit codes/failure reasons to a Django migration instead of a management command.

@mvandenburgh mvandenburgh force-pushed the add-failure-reason-exit-code-fields branch from 6ce55b1 to 4e3b23f Compare January 30, 2025 18:56
Copy link
Collaborator

@jjnesbitt jjnesbitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had one remaining question, but otherwise LGTM. We'll just need to make sure to disable flux before we merge this, so we can manually run migrate.

@mvandenburgh mvandenburgh merged commit 1e5f6d0 into main Jan 30, 2025
17 checks passed
@mvandenburgh mvandenburgh deleted the add-failure-reason-exit-code-fields branch January 30, 2025 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants