Scheduler Metrics #2674

d80tb7 · 2023-07-13T13:49:59Z

The new "pulsar backed" scheduler should expose a set of Prometheus metrics that shed light on its internal working. An initial set of metrics would be:

Scheduler cycle time
Number of jobs considered (per queue?)
Number of jobs scheduled (per cluster etc.)
Number of jobs preempted
Number of clusters scheduled
Evaluated fair share of each queue
Delta between fair share and usage of each queue
Did the cycle complete successfully (added 23/08)

Note that due to the way Prometheus works (i.e. it samples) we probably want to store some or all of these as histograms rather than gauges.

There is already some prior art for exposing Prometheus metrics in Armada- see for example here and here (the latter of those being the new scheduler exposing which instance is leader). We use the official Prometheus library for this, but we've found it quite difficult because:

It's hard to write unit tests
It is quite fiddly to use (lots of strings and array sizes that need to match up across different places in the code, with panics if they don't
Quite a lot of boilerplate to write
Everything is asynchronous.

It might therefore be worth evaluating one of two possible improvments here:

we're idiots and we're using this library incorrectly
there is another library that we can use which is more suited to our use case.

theAntiYeti self-assigned this Jul 13, 2023

Sharpz7 added type/design Design / Architecture suggestions component/scheduling Armada Server, Scheduler and Scheduler Injester labels Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler Metrics #2674

Scheduler Metrics #2674

d80tb7 commented Jul 13, 2023 •

edited by theAntiYeti

Loading

Scheduler Metrics #2674

Scheduler Metrics #2674

Comments

d80tb7 commented Jul 13, 2023 • edited by theAntiYeti Loading

d80tb7 commented Jul 13, 2023 •

edited by theAntiYeti

Loading