Skip to content

Commit

Permalink
Add a summary metric for route reload durations
Browse files Browse the repository at this point in the history
I'm super curious about how long these take, as we're loading around 1M routes
from the database every time the routes reload (over 1M in draft, just under in
live). Al reckons about 20s, based on the logs, but it would be good to know for
sure.

This adds a summary metric to which will allow us to calculate median / 90th /
95th / 99th percentile durations.

I've also added labels to the count / duration metrics so we can tell which ones
are successes and failures. If you don't put success / failure labels on your
duration metrics they can get all mucked up by quick failures and slow
successes, which you can't distinguish between.

Prometheus summaries / histograms[0] are a bit hard to wrap one's head around,
but I think summary is the right choice here. Key factors:

1) With Histograms, you have to specify the timings of the buckets you care
  about up front (and we don't know how long these reloads take, so that's hard)
2) Summaries let you specify which quantiles you want up front, with the
  calculation happening "on the client side" (i.e. inside router, before things go
  to prometheus), which is more expensive at observation time
3) We're not making many observations for this metric, because we only reload routes
  once every few seconds (max), so the cost of calculating the summary on the client
  side should be small.

The Objectives map sets the quantiles we care about, and an amount of error. In
this case, by setting `0.5: 0.01` I'm saying "bucket things so I get a quantile
that's between 0.49 and 0.51", and by setting `0.99: 0.005` I'm saying "bucket
things so I get a quantile that's between 0.985 and 0.995". They're not exact
for performance reasons.[1]

[0] - https://prometheus.io/docs/practices/histograms/
[1] - https://grafana.com/blog/2022/03/01/how-summary-metrics-work-in-prometheus/#limiting-the-error-an-upper-bound-for-delta
  • Loading branch information
richardTowers committed Aug 25, 2023
1 parent 511e043 commit d06c4f5
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 4 deletions.
18 changes: 17 additions & 1 deletion lib/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,26 @@ var (
[]string{"host"},
)

routeReloadCountMetric = prometheus.NewCounter(
routeReloadCountMetric = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "router_route_reload_total",
Help: "Total number of attempts to reload the routing table",
},
[]string{},
)

routeReloadDurationMetric = prometheus.NewSummaryVec(
prometheus.SummaryOpts{
Name: "router_route_reload_duration_seconds",
Help: "Histogram of route reload durations in seconds",
Objectives: map[float64]float64{
0.5: 0.01,
0.9: 0.01,
0.95: 0.01,
0.99: 0.005,
},
},
[]string{},
)

routeReloadErrorCountMetric = prometheus.NewCounter(
Expand All @@ -41,6 +56,7 @@ func registerMetrics(r prometheus.Registerer) {
r.MustRegister(
internalServerErrorCountMetric,
routeReloadCountMetric,
routeReloadDurationMetric,
routeReloadErrorCountMetric,
routesCountMetric,
)
Expand Down
10 changes: 7 additions & 3 deletions lib/router.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import (
"net/http"
"net/url"
"os"
"strconv"
"sync"
"time"

Expand Down Expand Up @@ -213,11 +214,11 @@ type mongoDatabase interface {
// create a new proxy mux, load applications (backends) and routes into it, and
// then flip the "mux" pointer in the Router.
func (rt *Router) reloadRoutes(db *mgo.Database, currentOptime bson.MongoTimestamp) {
startTime := time.Now()
defer func() {
// increment this metric regardless of whether the route reload succeeded
routeReloadCountMetric.Inc()

success := true
if r := recover(); r != nil {
success = false
logWarn("router: recovered from panic in reloadRoutes:", r)
logInfo("router: original routes have not been modified")
errorMessage := fmt.Sprintf("panic: %v", r)
Expand All @@ -228,6 +229,9 @@ func (rt *Router) reloadRoutes(db *mgo.Database, currentOptime bson.MongoTimesta
} else {
rt.mongoReadToOptime = currentOptime
}
labels := prometheus.Labels{"success": strconv.FormatBool(success)}
routeReloadCountMetric.With(labels).Inc()
routeReloadDurationMetric.With(labels).Observe(time.Since(startTime).Seconds())
}()

logInfo("router: reloading routes")
Expand Down

0 comments on commit d06c4f5

Please sign in to comment.