Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add CloudWatch alarm for failed Glue jobs #23

Merged
merged 3 commits into from
Nov 14, 2024

Conversation

patheard
Copy link
Member

@patheard patheard commented Nov 14, 2024

Summary

Add an EventBridge rule and custom CloudWatch metric to capture failed Glue jobs. A new alarm has also been added to trigger when a failure occurs.

This PR also includes a README explaining the reason for the ETL job JSON exports.

Related

Add an EventBridge rule and custom CloudWatch metric to capture
failed Glue jobs.  A new alarm has also been added to trigger when a
failure occurs.
@patheard patheard self-assigned this Nov 14, 2024
Copy link

Production: alarms 🚨

✅   Terraform Init: success
✅   Terraform Validate: success
✅   Terraform Format: success
✅   Terraform Plan: success
✅   Conftest: success

Plan: 3 to add, 0 to change, 0 to destroy
Show summary
CHANGE NAME
add aws_cloudwatch_event_rule.glue_job_failure
aws_cloudwatch_event_target.glue_job_failure
aws_cloudwatch_metric_alarm.glue_job_failures
Show plan
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_cloudwatch_event_rule.glue_job_failure will be created
  + resource "aws_cloudwatch_event_rule" "glue_job_failure" {
      + arn            = (known after apply)
      + description    = "Capture Glue job failures and timeouts"
      + event_bus_name = "default"
      + event_pattern  = jsonencode(
            {
              + detail      = {
                  + state = [
                      + "FAILED",
                      + "TIMEOUT",
                      + "ERROR",
                    ]
                }
              + detail-type = [
                  + "Glue Job State Change",
                ]
              + source      = [
                  + "aws.glue",
                ]
            }
        )
      + force_destroy  = false
      + id             = (known after apply)
      + name           = "glue-job-failures"
      + name_prefix    = (known after apply)
      + tags_all       = {
          + "CostCentre" = "PlatformDataLake"
          + "Terraform"  = "true"
        }
    }

  # aws_cloudwatch_event_target.glue_job_failure will be created
  + resource "aws_cloudwatch_event_target" "glue_job_failure" {
      + arn            = "arn:aws:events:ca-central-1:739275439843:api-destination/cloudwatch-metrics"
      + event_bus_name = "default"
      + force_destroy  = false
      + id             = (known after apply)
      + rule           = "glue-job-failures"
      + target_id      = "PublishMetric"

      + input_transformer {
          + input_paths    = {
              + "jobName" = "$.detail.jobName"
              + "state"   = "$.detail.state"
            }
          + input_template = jsonencode(
                {
                  + MetricData = [
                      + {
                          + Dimensions = [
                              + {
                                  + Name  = "JobName"
                                  + Value = "<jobName>"
                                },
                            ]
                          + MetricName = "glue-job-failure"
                          + Unit       = "Count"
                          + Value      = 1
                        },
                    ]
                  + Namespace  = "data-lake"
                }
            )
        }
    }

  # aws_cloudwatch_metric_alarm.glue_job_failures will be created
  + resource "aws_cloudwatch_metric_alarm" "glue_job_failures" {
      + actions_enabled                       = true
      + alarm_actions                         = [
          + "arn:aws:sns:ca-central-1:739275439843:cloudwatch-alarm-action",
        ]
      + alarm_description                     = "Failed Glue jobs in a 1 minute period."
      + alarm_name                            = "glue-job-failures"
      + arn                                   = (known after apply)
      + comparison_operator                   = "GreaterThanThreshold"
      + dimensions                            = {
          + "JobName" = "*"
        }
      + evaluate_low_sample_count_percentiles = (known after apply)
      + evaluation_periods                    = 1
      + id                                    = (known after apply)
      + metric_name                           = "glue-job-failure"
      + namespace                             = "data-lake"
      + ok_actions                            = [
          + "arn:aws:sns:ca-central-1:739275439843:cloudwatch-ok-action",
        ]
      + period                                = 60
      + statistic                             = "Sum"
      + tags_all                              = {
          + "CostCentre" = "PlatformDataLake"
          + "Terraform"  = "true"
        }
      + threshold                             = 0
      + treat_missing_data                    = "notBreaching"
    }

Plan: 3 to add, 0 to change, 0 to destroy.

─────────────────────────────────────────────────────────────────────────────

Saved the plan to: plan.tfplan

To perform exactly these actions, run the following command to apply:
    terraform apply "plan.tfplan"
Show Conftest results
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_event_rule.glue_job_failure"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.glue_crawler_error"]
WARN - plan.json - main - Missing Common Tags: ["aws_cloudwatch_metric_alarm.glue_job_failures"]
WARN - plan.json - main - Missing Common Tags: ["aws_kms_key.cloudwatch"]
WARN - plan.json - main - Missing Common Tags: ["aws_sns_topic.cloudwatch_alarm_action"]
WARN - plan.json - main - Missing Common Tags: ["aws_sns_topic.cloudwatch_ok_action"]

25 tests, 19 passed, 6 warnings, 0 failures, 0 exceptions

@patheard patheard merged commit 82b02e8 into main Nov 14, 2024
4 checks passed
@patheard patheard deleted the feat/glue-job-failures branch November 14, 2024 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants