Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[consul] add maintenance metric to Consul catalog. #1267

Closed
wants to merge 1 commit into from

Conversation

diogokiss
Copy link
Contributor

@diogokiss diogokiss commented Mar 16, 2018

What does this PR do?

It adds two metrics called consul.catalog.services_maintenance and consul.catalog.nodes_maintenance to Datadog Consul integration catalog.

The expected behavior is explained below.

All nodes in healthy states

  • Consul is green
    screen shot 2018-03-15 at 1 30 21 pm
  • Metrics per node are:
passing: 3
critical: 0
maintenance: 0

screen shot 2018-03-15 at 1 35 34 pm

  • Metrics per service are:
passing: 3
critical: 0
maintenance: 0

screen shot 2018-03-15 at 1 35 23 pm

One (1) node in faulty

  • Consul is failing for the faulty node
    screen shot 2018-03-15 at 1 40 52 pm

  • Metrics per node are:

passing: 2
critical: 1
maintenance: 0

screen shot 2018-03-15 at 1 44 18 pm

  • Metrics per service are:
passing: 2
critical: 1
maintenance: 0

screen shot 2018-03-15 at 1 44 03 pm

The faulty node is put in maintenance

  • Consul is failing for the faulty node and showing it also in maintenance
    screen shot 2018-03-15 at 1 47 03 pm

  • Metrics per node are:

passing: 2
critical: 0
maintenance: 1

screen shot 2018-03-15 at 1 48 25 pm

  • Metrics per service are:
passing: 2
critical: 0
maintenance: 1

screen shot 2018-03-15 at 1 48 18 pm

Motivation

What inspired you to submit this pull request?

Currently, if a node in Consul is set into maintenance state, it is reported
as a node in critical state to Datadog metrics. This makes it confusing to
determine by the metric whether it is a node failing with problems or an planned
intervention.

A user made a PR to Datadog some time ago, but it was not merged due to
Datadog code organization changes and got forgotten (DataDog/dd-agent#2496).
I'm pushing the change forward.

Testing Guidelines

An overview on testing
is available in our contribution guidelines.

Versioning

Additional Notes

This PR also depends on DataDog/dd-agent#3708.
The screenshots and tests were done using dd-agent version 5.8.0.
I've ported the code to the latest version here available, but didn't have the opportunity to actually test it against the latest version. Anyways, I added tests, which I believe to cover the functionality. Please, let me know how to improve it.

diogokiss pushed a commit to diogokiss/dd-agent that referenced this pull request Mar 16, 2018
@diogokiss diogokiss force-pushed the maintenance-metric branch 2 times, most recently from 93f344f to 29f2d71 Compare March 16, 2018 15:19
@diogokiss diogokiss closed this Mar 18, 2018
@diogokiss diogokiss reopened this Mar 18, 2018
@diogokiss diogokiss force-pushed the maintenance-metric branch 2 times, most recently from ff6171d to 32be53d Compare March 19, 2018 08:09
Currently, if a node in Consul is set into maintenance state, it is reported
as a node in critical state to Datadog metrics. This makes it confusing to
determine by the metric whether it is a node failing with problems or an planned
intervention. A user made a PR to Datadog some time ago, but it was not merged due to
Datadog code organization changes and got forgotten (DataDog/dd-agent#2496).
I'm pushing the change forward. I've tested it using dd-agent
version 5.8.0.

This PR also depends on DataDog/dd-agent#3708
@diogokiss diogokiss force-pushed the maintenance-metric branch from 32be53d to d69de86 Compare March 19, 2018 09:54
@diogokiss
Copy link
Contributor Author

diogokiss commented Mar 19, 2018

So, the error being reported is the following.

  File "/home/travis/build/DataDog/integrations-core/consul/datadog_checks/consul/__init__.py", line 1, in <module>
    from . import consul
  File "/home/travis/build/DataDog/integrations-core/consul/datadog_checks/consul/consul.py", line 50, in <module>
    class ConsulCheck(AgentCheck):
  File "/home/travis/build/DataDog/integrations-core/consul/datadog_checks/consul/consul.py", line 66, in ConsulCheck
    'maintenance': AgentCheck.MAINTENANCE
AttributeError: type object 'AgentCheck' has no attribute 'MAINTENANCE'

As this PR has a dependency on DataDog/dd-agent#3708, this error is kinda expected. So, either I hard code the int (4) in the consul.py file or we merge the constant first in the other PR.

What would be your suggestion, @masci ?

@masci
Copy link
Contributor

masci commented Apr 13, 2018

Hi @diogokiss
first of all, thanks for the PR, it's outstanding!

We discussed the opportunity to merge DataDog/dd-agent#3708 but we won't proceed for the following reasons:

  1. the MAINTENANCE status is a bit too specific and this might be the only use case we face
  2. there's some non trivial work to do on the backend to make the new status of any use to the app and provided 1), we don't think it's worth it.

Is there any chance you can use WARNING for this use case?

@masci masci self-assigned this Apr 13, 2018
@stale
Copy link

stale bot commented Jun 25, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there's been inactivity. Thank you for participating in the Datadog open source community.

@stale stale bot added the stale label Jun 25, 2018
@masci
Copy link
Contributor

masci commented Aug 31, 2018

Closing for lack of activity

@masci masci closed this Aug 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants