Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: publish Glue job failures to SNS #24

Merged
merged 1 commit into from
Nov 14, 2024

Conversation

patheard
Copy link
Member

@patheard patheard commented Nov 14, 2024

Summary

Update Glue job failure event triggers to publish directly to SNS using the CloudWatch alarm message payload structure. This allows us to publish to the existing SRE Bot webhook without the need for a custom Lambda.

Related

Update Glue job failures event triggers to publish directly to SNS using
the CloudWatch alarm message payload structure.  This allows us to
publish to the existing SRE Bot webhook.
@patheard patheard self-assigned this Nov 14, 2024
Copy link

Production: alarms 🚨

✅   Terraform Init: success
✅   Terraform Validate: success
✅   Terraform Format: success
✅   Terraform Plan: success
✅   Conftest: success

⚠️   Warning: resources will be destroyed by this change!

Plan: 1 to add, 0 to change, 1 to destroy
Show summary
CHANGE NAME
delete aws_cloudwatch_metric_alarm.glue_job_failures
add aws_cloudwatch_event_target.glue_job_failure
Show plan
Resource actions are indicated with the following symbols:
  + create
  - destroy

Terraform will perform the following actions:

  # aws_cloudwatch_event_target.glue_job_failure will be created
  + resource "aws_cloudwatch_event_target" "glue_job_failure" {
      + arn            = "arn:aws:sns:ca-central-1:739275439843:cloudwatch-alarm-action"
      + event_bus_name = "default"
      + force_destroy  = false
      + id             = (known after apply)
      + rule           = "glue-job-failures"
      + target_id      = "send-to-sns"

      + input_transformer {
          + input_paths    = {
              + "jobName" = "$.detail.jobName"
              + "message" = "$.detail.message"
              + "state"   = "$.detail.state"
            }
          + input_template = jsonencode(
                {
                  + Message = jsonencode(
                        {
                          + AWSAccountId     = "739275439843"
                          + AlarmArn         = "arn:aws:cloudwatch:ca-central-1:739275439843:alarm:glue-job-failure"
                          + AlarmDescription = "`<state>` detected for Glue job `<jobName>`"
                          + AlarmName        = "glue-job-failure"
                          + NewStateReason   = "<message>"
                          + NewStateValue    = "ALARM"
                          + OldStateValue    = "OK"
                        }
                    )
                }
            )
        }
    }

  # aws_cloudwatch_metric_alarm.glue_job_failures will be destroyed
  # (because aws_cloudwatch_metric_alarm.glue_job_failures is not in configuration)
  - resource "aws_cloudwatch_metric_alarm" "glue_job_failures" {
      - actions_enabled                       = true -> null
      - alarm_actions                         = [
          - "arn:aws:sns:ca-central-1:739275439843:cloudwatch-alarm-action",
        ] -> null
      - alarm_description                     = "Failed Glue jobs in a 1 minute period." -> null
      - alarm_name                            = "glue-job-failures" -> null
      - arn                                   = "arn:aws:cloudwatch:ca-central-1:739275439843:alarm:glue-job-failures" -> null
      - comparison_operator                   = "GreaterThanThreshold" -> null
      - datapoints_to_alarm                   = 0 -> null
      - dimensions                            = {
          - "JobName" = "*"
        } -> null
      - evaluation_periods                    = 1 -> null
      - id                                    = "glue-job-failures" -> null
      - metric_name                           = "glue-job-failure" -> null
      - namespace                             = "data-lake" -> null
      - ok_actions                            = [
          - "arn:aws:sns:ca-central-1:739275439843:cloudwatch-ok-action",
        ] -> null
      - period                                = 60 -> null
      - statistic                             = "Sum" -> null
      - tags_all                              = {
          - "CostCentre" = "PlatformDataLake"
          - "Terraform"  = "true"
        } -> null
      - threshold                             = 0 -> null
      - treat_missing_data                    = "notBreaching" -> null
        # (4 unchanged attributes hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

─────────────────────────────────────────────────────────────────────────────

Saved the plan to: plan.tfplan

To perform exactly these actions, run the following command to apply:
    terraform apply "plan.tfplan"
Show Conftest results
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_event_rule.glue_job_failure"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.glue_crawler_error"]
WARN - plan.json - main - Missing Common Tags: ["aws_kms_key.cloudwatch"]
WARN - plan.json - main - Missing Common Tags: ["aws_sns_topic.cloudwatch_alarm_action"]
WARN - plan.json - main - Missing Common Tags: ["aws_sns_topic.cloudwatch_ok_action"]

24 tests, 19 passed, 5 warnings, 0 failures, 0 exceptions

@patheard patheard merged commit 5dde3e9 into main Nov 14, 2024
4 checks passed
@patheard patheard deleted the fix/glue-job-eventbridge-error branch November 14, 2024 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants