Skip to content

Commit

Permalink
Add healthcheck for the local container (#37)
Browse files Browse the repository at this point in the history
* feat(health): add an healthchecker

* feat(readiness): add a timeout

* feat(notifier): add a timeout

* feat(cmd): wire notify and readiness timeouts

* refactor(cmd): make config test suite less painful

* feat(cmd): run healthcheck

* feat(helm): run healtcheck

* chore(cmd): improve flags description

* chore(README): update reference
  • Loading branch information
jlevesy authored May 30, 2023
1 parent 24c8c0e commit eb18955
Show file tree
Hide file tree
Showing 12 changed files with 620 additions and 210 deletions.
43 changes: 34 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,11 @@ Prometheus (in agent mode) can be used to push metrics to a remote storage backe

One approach of this problem is to have multiple agents pushing the same set of metrics to the storage backend. This requires to run some sort of metrics deduplication on the storage backend side to ensure correctness.

Using `prometheus-elector`, we can instead make sure that only one Prometheus instance has `remote_write` enabled at any point of time and guarantee a reasonable delay (seconds) for another instance to take over when leading instance becomes unavailable. It minimizes (avoids?) data loss and avoids running some expensive deduplication logic on the storage backend side.
Using `prometheus-elector`, we can instead make sure that only one Prometheus instance has `remote_write` enabled at any point of time and guarantee a reasonable delay (seconds) for another instance to take over when leading instance becomes unavailable. This brings the following advantages:

- It minimizes (avoids?) data loss
- It avoids running some expensive deduplication logic on the storage backend side
- It is also much more efficient in term of resource usage (RAM and Network) because only one replica does the scrapping and pushing samples

![illustration](./docs/assets/agent-diagram.svg)

Expand Down Expand Up @@ -99,6 +103,13 @@ As it is implemented, it relies on a few assumptions:
- The `member_id` of the replica is the `pod` name.
- The `<pod_name>.<service_name>` domain name is resolvable via DNS. This is a property of statfulsets in Kubernetes, but it requires the cluster to have DNS support enabled.

#### Monitoring the Local Prometheus

prometheus-elector also continuously monitors its local Prometheus instance to optimize its participation to the elader election to minimize downtime:

- When starting, it waits for the local prometheus instance to be ready before taking part to the election
- It automatically leaves the election if the local Prometheus instance is not considered healthy.It then joins back as soon as the local instance goes back to an healthy state.

### Installing Prometheus Elector

You can find [an helm chart](./helm) in this repository, as well as [values for the HA agent example](./example/k8s/agent-values.yaml).
Expand All @@ -117,7 +128,7 @@ If the leader proxy is enabled, all HTTP calls received on the port 9095 are for

```
-api-listen-address string
HTTP listen address to use for the API. (default ":9095")
HTTP listen address for the API. (default ":9095")
-api-proxy-enabled
Turn on leader proxy on the API
-api-proxy-prometheus-local-port uint
Expand All @@ -130,34 +141,48 @@ If the leader proxy is enabled, all HTTP calls received on the port 9095 are for
Grace delay to apply when shutting down the API server (default 15s)
-config string
Path of the prometheus-elector configuration
-healthcheck-failure-threshold int
Amount of consecutives failures to consider Prometheus unhealthy (default 3)
-healthcheck-http-url string
URL to the Prometheus health endpoint
-healthcheck-period duration
Healthcheck period (default 5s)
-healthcheck-success-threshold int
Amount of consecutives success to consider Prometheus healthy (default 3)
-healthcheck-timeout duration
HTTP timeout for healthchecks (default 2s)
-init
Only init the prometheus config file
-kubeconfig string
Path to a kubeconfig. Only required if out-of-cluster.
-lease-duration duration
Duration of a lease, client wait the full duration of a lease before trying to take it over (default 15s)
-lease-name string
Name of lease lock
Name of lease resource
-lease-namespace string
Name of lease lock namespace
Name of lease resource namespace
-lease-renew-deadline duration
Maximum duration spent trying to renew the lease (default 10s)
-lease-retry-period duration
Delay between two attempts of taking/renewing the lease (default 2s)
-notify-http-method string
HTTP method to use when sending the reload config request. (default "POST")
HTTP method to use when sending the reload config request (default "POST")
-notify-http-url string
URL to the reload configuration endpoint
-notify-retry-delay duration
How much time to wait between two notify retries. (default 10s)
Delay between two notify retries. (default 10s)
-notify-retry-max-attempts int
How many times to retry notifying prometheus on failure. (default 5)
How many retries for configuration update (default 5)
-notify-timeout duration
HTTP timeout for notify retries. (default 2s)
-output string
Path to write the active prometheus configuration
Path to write the Prometheus configuration
-readiness-http-url string
URL to Prometheus ready endpoint
URL to the Prometheus ready endpoint
-readiness-poll-period duration
Poll period prometheus readiness check (default 5s)
-readiness-timeout duration
HTTP timeout for readiness calls (default 2s)
-runtime-metrics
Export go runtime metrics
```
69 changes: 60 additions & 9 deletions cmd/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,19 @@ type cliConfig struct {
notifyHTTPMethod string
notifyRetryMaxAttempts int
notifyRetryDelay time.Duration
notifyTimeout time.Duration

// How to wait for prometheus to be ready.
readinessHTTPURL string
readinessPollPeriod time.Duration
readinessTimeout time.Duration

// How to monitor prometheus health.
healthcheckHTTPURL string
healthcheckPeriod time.Duration
healthcheckTimeout time.Duration
healthcheckSuccessThreshold int
healthcheckFailureThreshold int

// API setup
apiListenAddr string
Expand Down Expand Up @@ -107,10 +116,38 @@ func (c *cliConfig) validateRuntimeConfig() error {
return errors.New("invalid notify-retry-delay, should be >= 1")
}

if c.notifyTimeout < 1 {
return errors.New("invalid notify-timeout, should be >= 1")
}

if c.readinessPollPeriod < 1 {
return errors.New("invalid readiness-poll-period, should be >= 1")
}

if c.readinessTimeout < 1 {
return errors.New("invalid readiness-timeout, should be >= 1")
}

if c.healthcheckPeriod < 1 {
return errors.New("invalid healthcheck-period, should be >= 1")
}

if c.healthcheckTimeout < 1 {
return errors.New("invalid healthcheck-timeout, should be >= 1")
}

if c.healthcheckSuccessThreshold < 1 {
return errors.New("invalid healthcheck-success-threshold, should be >= 1")
}

if c.healthcheckFailureThreshold < 1 {
return errors.New("invalid healthcheck-failure-threshold, should be >= 1")
}

if c.readinessTimeout < 1 {
return errors.New("invalid readiness-timeout, should be >= 1")
}

if c.apiListenAddr == "" {
return errors.New("missing api-listen-address")
}
Expand All @@ -137,22 +174,36 @@ func (c *cliConfig) validateRuntimeConfig() error {
}

func (c *cliConfig) setupFlags() {
flag.StringVar(&c.leaseName, "lease-name", "", "Name of lease lock")
flag.StringVar(&c.leaseNamespace, "lease-namespace", "", "Name of lease lock namespace")
flag.BoolVar(&c.init, "init", false, "Only init the prometheus config file")

flag.StringVar(&c.leaseName, "lease-name", "", "Name of lease resource")
flag.StringVar(&c.leaseNamespace, "lease-namespace", "", "Name of lease resource namespace")
flag.DurationVar(&c.leaseDuration, "lease-duration", 15*time.Second, "Duration of a lease, client wait the full duration of a lease before trying to take it over")
flag.DurationVar(&c.leaseRenewDeadline, "lease-renew-deadline", 10*time.Second, "Maximum duration spent trying to renew the lease")
flag.DurationVar(&c.leaseRetryPeriod, "lease-retry-period", 2*time.Second, "Delay between two attempts of taking/renewing the lease")

flag.StringVar(&c.kubeConfigPath, "kubeconfig", "", "Path to a kubeconfig. Only required if out-of-cluster.")

flag.StringVar(&c.configPath, "config", "", "Path of the prometheus-elector configuration")
flag.StringVar(&c.outputPath, "output", "", "Path to write the active prometheus configuration")
flag.StringVar(&c.readinessHTTPURL, "readiness-http-url", "", "URL to Prometheus ready endpoint")
flag.StringVar(&c.outputPath, "output", "", "Path to write the Prometheus configuration")

flag.StringVar(&c.readinessHTTPURL, "readiness-http-url", "", "URL to the Prometheus ready endpoint")
flag.DurationVar(&c.readinessPollPeriod, "readiness-poll-period", 5*time.Second, "Poll period prometheus readiness check")
flag.DurationVar(&c.readinessTimeout, "readiness-timeout", 2*time.Second, "HTTP timeout for readiness calls")

flag.StringVar(&c.healthcheckHTTPURL, "healthcheck-http-url", "", "URL to the Prometheus health endpoint")
flag.DurationVar(&c.healthcheckPeriod, "healthcheck-period", 5*time.Second, "Healthcheck period")
flag.DurationVar(&c.healthcheckTimeout, "healthcheck-timeout", 2*time.Second, "HTTP timeout for healthchecks")
flag.IntVar(&c.healthcheckSuccessThreshold, "healthcheck-success-threshold", 3, "Amount of consecutives success to consider Prometheus healthy")
flag.IntVar(&c.healthcheckFailureThreshold, "healthcheck-failure-threshold", 3, "Amount of consecutives failures to consider Prometheus unhealthy")

flag.StringVar(&c.notifyHTTPURL, "notify-http-url", "", "URL to the reload configuration endpoint")
flag.StringVar(&c.notifyHTTPMethod, "notify-http-method", http.MethodPost, "HTTP method to use when sending the reload config request.")
flag.IntVar(&c.notifyRetryMaxAttempts, "notify-retry-max-attempts", 5, "How many times to retry notifying prometheus on failure.")
flag.DurationVar(&c.notifyRetryDelay, "notify-retry-delay", 10*time.Second, "How much time to wait between two notify retries.")
flag.BoolVar(&c.init, "init", false, "Only init the prometheus config file")
flag.StringVar(&c.apiListenAddr, "api-listen-address", ":9095", "HTTP listen address to use for the API.")
flag.StringVar(&c.notifyHTTPMethod, "notify-http-method", http.MethodPost, "HTTP method to use when sending the reload config request")
flag.IntVar(&c.notifyRetryMaxAttempts, "notify-retry-max-attempts", 5, "How many retries for configuration update")
flag.DurationVar(&c.notifyRetryDelay, "notify-retry-delay", 10*time.Second, "Delay between two notify retries.")
flag.DurationVar(&c.notifyTimeout, "notify-timeout", 2*time.Second, "HTTP timeout for notify retries.")

flag.StringVar(&c.apiListenAddr, "api-listen-address", ":9095", "HTTP listen address for the API.")
flag.DurationVar(&c.apiShutdownGraceDelay, "api-shutdown-grace-delay", 15*time.Second, "Grace delay to apply when shutting down the API server")
flag.BoolVar(&c.apiProxyEnabled, "api-proxy-enabled", false, "Turn on leader proxy on the API")
flag.UintVar(&c.apiProxyPrometheusLocalPort, "api-proxy-prometheus-local-port", 9090, "Listening port of the local prometheus instance")
Expand Down
Loading

0 comments on commit eb18955

Please sign in to comment.