Terraform configuration for managing Datadog
Here is where we are housing our codified monitors in Datadog(wip, what isn't?). As we begin to get closer to our goal of sensible observability. on both micro and macro levels as well as horizontially. You will find monitors modularized by: 1. Per Service Alert 2. Ops Alerts 3. RabbitMq Alerts 4. Paging Alerts
Current Module Breakdown of alerts
- Cluster Alerts
- node not ready / node in {{cluster_name}}
- Node Memory Pressure AWS autoscaling Alerts_
- node not ready / node in {{cluster_name}}
- Node Memory Pressure
- Scaling alerts_
- kube api errors/down
- hpa errors
- pending pods
- nodes have increased
- below desired replicas
- above desired replicas Kube Alerts_
- deploy replica down
- pod restarting
- statefulset repliva down
- daemonset pod down
- multiple pods failing
- unavailable statefulset replica
- node status unscheduable
- k8s imagepullbackoff
- pending pods
- Service Alerts_
- service errors
- service container restart
- service crashloop
- pod status terminated
- pod not ready
- pod recent restarts
- pod status error
- oom detected
- pod crashes
- network rx (receive)errors
- Rabbitmq Alerts_
- Rabbitmq Queue Status (move back from Readme)
- Rabbitmq High Memory Critical
- Rabbitmq High Queue Count
- Node Down (Includes pod.phase if it's not in a running state)
- Rabbitmq High message count
- Rabbitmq disk usage
- Rabbitmq unacknowledged rate too high
- Rabbitmq Staging alerts_
- Rabbitmq Queue Status (move back from Readme)
- Rabbitmq High Memory Critical
- Rabbitmq High Queue Count
- Node Down (Includes pod.phase if it's not in a running state)
- Rabbitmq High message count
- Rabbitmq disk usage
- Rabbitmq unacknowledged rate too high
- RDS Alerts_
- RDS Replica Lag
- RDS swap
- RDS Free Memory
- RDS Connections
- RDS High CPU
- RDS Disk Queue
- Paging Alerts_
- SLA
- Slow
Alert Rules and Routing While we are establishing alerts and alerting rules, we are also establishing alerting routes. These are currently in flux so while we are fine tuning and calibrating our monitors, these routes are commented out of the alerts being managed by this repo. Once the alert is established as fully tactically operational the appropriate alert routing will be uncommented and applied
Possible duplicate monitors:
- RabbitMq (queue_status: -- this is covered with the Node down alert)
resource "datadog_monitor" "queue_status" {
name = "Rabbitmq Status Error"
type = "query_alert"
query = "avg(last_1m):max:kubernetes_state.pod.status_phase{pod_name:idle-narwhal-rabbitmq-0} by {pod_phase,pod_name} + max:kubernetes_state.pod.status_phase{pod_name:idle-narwhal-rabbitmq-1} by {pod_phase,pod_name} + max:kubernetes_state.pod.status_phase{pod_name:idle-narwhal-rabbitmq-2} by {pod_phase,pod_name} < 1"
message = <<-EOM
EOM
monitor_thresholds {
critical = 0
}
require_full_window = false
notify_no_data = false
renotify_interval = 0
include_tags = true
tags = [
"rabbitmq",
"managed_by:terraform"
]
}
TODO
- Ticketing Backlog
- Organization - create - Chris/Gabe/Brian
- add users (parent/child org) - Chris/Gabe
module "datadog_child_organization" { source = "/platform/datadog//modules/child_organization" # version = "x.x.x" //PINNED VERSION(YOU CAN DELETE THE COMMENT) organization_name = "test" saml_enabled = false # Note that Free and Trial organizations cannot enable SAML saml_autocreate_users_domains = [] saml_autocreate_users_enabled = false saml_idp_initiated_login_enabled = true saml_strict_mode_enabled = false private_widget_share = false saml_autocreate_access_role = "ro" context = module.this.context }
- adding roles (and how this will help with auditing - Brian/Adam - we will have to define roles. I can start with some that make sense - look to see what we are using with AWS and mimic that mimic that)
module "monitor_configs" { source = "/config/yaml" version = "0.8.1" map_config_local_base_path = path.module map_config_paths = var.monitor_paths context = module.this.context } module "role_configs" { source = "/config/yaml" version = "0.8.1" map_config_local_base_path = path.module map_config_paths = var.role_paths context = module.this.context } locals { monitors_write_role_name = module.datadog_roles.datadog_roles["monitors-write"].name monitors_downtime_role_name = module.datadog_roles.datadog_roles["monitors-downtime"].name monitors_roles_map = { aurora-replica-lag = [local.monitors_write_role_name, local.monitors_downtime_role_name] ec2-failed-status-check = [local.monitors_write_role_name, local.monitors_downtime_role_name] redshift-health-status = [local.monitors_downtime_role_name] k8s-deployment-replica-pod-down = [local.monitors_write_role_name] } } module "datadog_roles" { source = "/platform/datadog//modules/roles" # version = "x.x.x" datadog_roles = module.role_configs.map_configs context = module.this.context } module "datadog_monitors" { source = "/platform/datadog//modules/monitors" # version = "x.x.x" datadog_monitors = module.monitor_configs.map_configs alert_tags = var.alert_tags alert_tags_separator = var.alert_tags_separator restricted_roles_map = local.monitors_roles_map context = module.this.context }
- (look up what it means to "pin version" that we are using ) - When using module "version" will be monitor resource attribute.