Keep datadog monitors/dashboards/etc in version control, avoid chaotic management via UI.
- Documented, reusable, automated, and searchable configuration
- Changes are PR reviewed and auditable
- Good defaults like no-data / re-notify are preselected
- Reliable cleanup with automated deletion
- create a new private
kennel
repo for your organization (do not fork this repo) - use the template folder as starting point:
git clone [email protected]:your-org/kennel.git git clone [email protected]:grosser/kennel.git seed mv seed/teamplate/* kennel/ cd kennel && git add . && git commit -m 'initial'
- add a basic projects and teams so others can copy-paste to get started
- setup travis build for your repo
- uncomment
.travis.yml
section for automated github PR feedback and datadog updates on merge - follow
Setup
in your repos Readme.md
projects/
monitors/dashboards/etc scoped by projectteams/
team definitionsparts/
monitors/dashes/etc that are used by multiple projectsgenerated/
projects as json, to show current state and proposed changes in PRs
# teams/my_team.rb
module Teams
class MyTeam < Kennel::Models::Team
defaults(
slack: -> { "my-alerts" },
email: -> { "[email protected]" }
)
end
end
- use datadog monitor UI to create a monitor
- get the
id
from the url RESOURCE=monitor ID=12345 bundle exec rake kennel:import
- see below
- find or create a project in
projects/
- add a monitor to
parts: [
list
# projects/my_project.rb
class MyProject < Kennel::Models::Project
defaults(
team: -> { Teams::MyTeam.new }, # use existing team or create new one in teams/
parts: -> {
[
Kennel::Models::Monitor.new(
self,
id: -> { 123456 }, # id from datadog url, not necessary when creating a new monitor
type: -> { "query alert" },
kennel_id: -> { "load-too-high" }, # make up a unique name
name: -> { "Foobar Load too high" }, # nice descriptive name that will show up in alerts and emails
message: -> {
# Explain what behavior to expect and how to fix the cause. Use #{super()} to add team notifications.
<<~TEXT
Foobar will be slow and that could cause Barfoo to go down.
Add capacity or debug why it is suddenly slow.
#{super()}
TEXT
},
query: -> { "avg(last_5m):avg:system.load.5{hostgroup:api} by {pod} > #{critical}" }, # replace actual value with #{critical} to keep them in sync
critical: -> { 20 }
)
]
}
)
end
bundle exec rake plan
update to existing should be shown (not Create / Delete)- alternatively:
bundle exec rake generate
to only update the generatedjson
files - review changes then
git commit
- make a PR ... get reviewed ... merge
- datadog is updated by travis
- go to datadog dashboard UI and click on New Dashboard to create a dashboard
- get the
id
from the url RESOURCE=dashboard ID=abc-def-ghi bundle exec rake kennel:import
- see below
- find or create a project in
projects/
- add a dashboard to
parts: [
list
class MyProject < Kennel::Models::Project
defaults(
team: -> { Teams::MyTeam.new }, # use existing team or create new one in teams/
parts: -> {
[
Kennel::Models::Dashboard.new(
self,
id: -> { "abc-def-ghi" }, # id from datadog url, not needed when creating a new dashboard
title: -> { "My Dashboard" },
description: -> { "Overview of foobar" },
template_variables: -> { ["environment"] }, # see https://docs.datadoghq.com/api/?lang=ruby#timeboards
kennel_id: -> { "overview-dashboard" }, # make up a unique name
layout_type: -> { "ordered" },
definitions: -> {
[ # An array or arrays, each one is a graph in the dashboard, alternatively a hash for finer control
[
# title, viz, type, query, edit an existing graph and see the json definition
"Graph name", "timeseries", "area", "sum:mystats.foobar{$environment}"
],
[
# queries can be an Array as well, this will generate multiple requests
# for a single graph
"Graph name", "timeseries", "area", ["sum:mystats.foobar{$environment}", "sum:mystats.success{$environment}"],
# add events too ...
events: [{q: "tags:foobar,deploy", tags_execution: "and"}]
]
]
}
)
]
}
)
end
Some validations might be too strict for your usecase or just wrong, please open an issue and
to unblock use the validate: -> { false }
option.
To link to existing monitors via their kennel_id
- Screens
uptime
widgets can usemonitor: {id: "foo:bar"}
- Screens
alert_graph
widgets can usealert_id: "foo:bar"
- rebase on updated
master
to not undo other changes - figure out project name by converting the class name to snake-case
- run
PROJECT=foo bundle exec rake kennel:update_datadog
to test changes for a single project
Run rake kennel:alerts TAG=service:my-service
to see all un-muted alerts for a given datadog monitor tag.
Add to parts/<folder>
.
module Monitors
class LoadTooHigh < Kennel::Models::Monitor
defaults(
name: -> { "#{project.name} load too high" },
message: -> { "Shut it down!" },
type: -> { "query alert" },
query: -> { "avg(last_5m):avg:system.load.5{hostgroup:#{project.kennel_id}} by {pod} > #{critical}" }
)
end
end
Reuse it in multiple projects.
class Database < Kennel::Models::Project
defaults(
team: -> { Kennel::Models::Team.new(slack: -> { 'foo' }, kennel_id: -> { 'foo' }) },
parts: -> { [Monitors::LoadTooHigh.new(self, critical: -> { 13 })] }
)
end
rake play
cd template
rake kennel:plan
Then make changes to play around, do not commit changes and make sure to revert with a rake kennel:update
after deleting everything.
To make changes via the UI, make a new free datadog account and use it's credentaisl instead.
Michael Grosser
[email protected]
License: MIT