Skip to content

Monitoring the monitoring system

Deepak Narayana Rao edited this page Oct 17, 2017 · 3 revisions

Overview

Image: Edit Link

Context

  • Prometheus and alert manager are running as services inside docker swarm. These services being stateful, are made to run on single agent node labeled to run the service. These services can go down in following scenarios
    • SC1: The agent node on which there services run goes down
    • SC2: These service are removed from docker swarm by bug or mistake in deploy scripts
    • SC3: Complete datacenter failure
  • We need a system which alerts us when prometheus or alert manager service goes down

Constraints

  • The system monitoring prometheus and alert manager should not be inside docker swarm to tolerate swarm failures
  • Prometheus and alertmanager should not need to be exposed to outside world
  • There should be a monitoring system outside the datacenter to handle datacenter failures

Monit setup for internal monitoring

  • As shown in figure above, monit service is run on one of the swarm manager node(not as a service in swarm)
  • Monit is configured to probe prometheus and alert manager http endpoints periodically (every minute)
  • Monit sends alert notifications when prometheus or alert manager is down for 5 checks (5 minutes)
  • This setup covers failure scenarios SC1 and SC2 listed above

External monitoring setup

Note: This work has not started yet

Alternatives

  • DeadmanSnitch: In this case prometheus and alert manager should made to periodically ping DeadmanSnitch. If the DeadmanSnitch doesn't receive ping from these services, it will raise an alert. Drawbacks are
    • Complexity needed in making prometheus and alert manager ping periodically DeadmanSnitch
    • There only paid plans
Clone this wiki locally