Skip to content

Decision Engine Metrics Developer Documentation

Shreyas Bhat edited this page Dec 23, 2021 · 3 revisions

Introduction

For any production-level application, it is important to collect useful metrics so that operators can, at a glance, ascertain the state of the system. For the Decision Engine (https://github.com/HEPCloud/decisionengine), we decided to use a lightly-wrapped version of the Prometheus metrics framework (https://prometheus.io/) to instrument the application. This document outlines how metrics collection occurs, how to add new metrics to the application, and how these are exposed for metrics collection by an external source (ideally a Prometheus server).

How the metrics collection pipeline works

Metrics in the Decision Engine (DE) are collected in the application, and stored in memory for single process applications, and on disk for multiprocess applications like the DE (see Multiprocess mode for more details). When a request for the metrics is received by the DE server (either XML-RPC or HTTP), the metrics are aggregated and returned in a plaintext format detailed here (https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details). This data can then be read by a Prometheus server (as is done in the Fermilab installation of the Decision Engine), or can be parsed by another application.

Multiprocess mode

Because any shared information - like metrics - in a multiprocess application cannot be kept in memory, and are kept on disk, there are a few steps that must be taken before collecting and sharing metrics.

  1. A directory to store metrics information must be specified and exported as the environment variable PROMETHEUS_MULTIPROC_DIR
  2. PROMETHEUS_MULTIPROC_DIR must be cleared between each startup of the Decision Engine

Running Decision Engine executable directly

If running the Decision Engine executable directly (not through systemd), this can be done by scripting up something like the following:

export PROMETHEUS_MULTIPROC_DIR="/tmp/prometheus_metrics/" 
rm -Rf ${PROMETHEUS_MULTIPROC_DIR}
mkdir -p ${PROMETHEUS_MULTIPROC_DIR}
decisionengine

Running Decision Engine via Systemd

If you are running the decisionengine using systemd, the unit file installed by the decisionengine RPM will direct systemd to create a new private temporary directory in /tmp for the service each time it is restarted. By then specifying a value for PROMETHEUS_MULTIPROC_DIR in the service environment file (by default /etc/sysconfig/decisionengine), systemd will create this directory inside the private temporary directory. For example, if you specify that PROMETHEUS_MULTIPROC_DIR should be set to /tmp, the actual path to the metrics directory will be something like

/tmp/systemd-private-477a6a41b1d149a0927ec896b12448a9-decisionengine.service-oJPCqJ/tmp/

When the decisionengine service is then stopped or restarted, this private temporary directory will be deleted.

Multiprocess module import and Prometheus multiprocess values

During the initial development of the metrics for the Decision Engine, a vital point was observed about how Prometheus stores metrics values. In the imports section of metrics.py file, one can notice that there are two imports: prometheus_client and MultiprocessCollector (from prometheus_client.multiprocess).

At runtime, by the time of any import of any prometheus_client library, the environment must be set correctly - that is, PROMETHEUS_MULTIPROC_DIR must be set. The prometheus_client library decides if the client is being run in single- or multi-process mode based on whether this is set at import time, not at metrics collection time. If PROMETHEUS_MULTIPROC_DIR is not set before import time, the metrics will not be stored properly, and thus cannot be retrieved properly.

Details of wrapped Prometheus metrics

The Decision Engine metrics types (classes, really) are very simply-wrapped Prometheus metrics, where we do little more than set applicable defaults upon instantiation so the Decision Engine developer need not concern themself with certain details.

The definitions of the metrics are in metrics.py, at src/decisionengine/framework/util/metrics.py. In this module, the various metrics types (classes) are defined, along with any other metrics-related utilities. Modifications to any metrics or metrics-related utilities should be made in this file as much as possible, and then imported from here to the rest of the Decision Engine code.

Different types of metrics and adding them to code

The four metric types are:

  1. Gauge: Record a numeric value that can change to any other numeric value at any time
  2. Counter: Record a numeric value of a quantity that is monotonically increasing
  3. Summary: Record the size and number of events for a quantity
  4. Histogram: Same as summary, except quantile information is included so that the metric can be used to create histograms.

Adding these to the code is very simple, and involves two steps:

  1. Instantiating the metric: The naming convention used for the metric instance is <What’s being measured>_. For example, STATUS_TIME_HISTOGRAM for a histogram measuring the status. The arguments to the instance declaration are generally two or three:
    1. The prometheus name of the metric (should be close to the instance name, but should follow the prometheus convention (roughly description_type of measurement_units) laid out here: https://prometheus.io/docs/practices/naming/), and
    2. Simple help text explaining what the metric is measuring
    3. (Optional) Labels that can be used to differentiate the same measurements of two different instances (for example, to differentiate which channel is being started when measuring channel startup time). So for example, a metric instantiation could look like:
START_CHANNEL_HISTOGRAM = Histogram("de_client_start_channel_duration_seconds", "Time to run de-client --start-channel", ["channel_name"])
  1. Instrumenting the code with this metric: This can be done in a variety of ways. A couple of examples will be detailed here, but for more details, please refer to the Prometheus python library documentation here: (https://github.com/prometheus/client_python#instrumenting).
    1. Calling a method on a metric class to record a value directly: One can simply set the value of an instantiated metric, for example, a Gauge MY_GAUGE, anywhere in the code by using the following call: MY_GAUGE.set(42). To add further dimensionality using labels, the call could look like this: MY_GAUGE.labels(my_label_value).set(42).
    2. Calling a method on a metric class as a context manager: For example, if using the Histogram metric, such as our example instance above START_CHANNEL_HISTOGRAM, we can record a time duration to that metric by doing the following:
with START_CHANNEL_HISTOGRAM.labels(channel_name).time():
    do_something_that_starts_channel()

Exposition of metrics (XML-RPC)

The metrics are exposed via the XML-RPC server that is the main Decision Engine server with a simple command: de-client –metrics. This will simply run the display_metrics function from the decisionengine.framework.util.metrics module, which prints out the plaintext values of the metrics at call-time.

Exposition of metrics (HTTP)

The metrics that have been collected are exposed via HTTP through the CherryPy (https://docs.cherrypy.dev) webserver that, by default, runs on port 8000. This port is configurable in the Decision Engine configuration file (/etc/decisionengine/decisionengine.jsonnet by default), under the webserver.port entry. You can view the metrics by running the following:

$ curl localhost:8000/metrics

This will run the same method as de_client --metrics would (DecisionEngine.rpc_metrics), which prints out the same plaintext metrics output. This is the recommended path for scraping the metrics with a prometheus server, or other metrics framework.

Turning off metrics exposition

To opt out of exposing the metrics via the webserver or XML-RPC calls, there is a --no-webserver flag. This is named as such, because the metrics will still be collected, and can be accessed by the operator on the filesystem, but they will not be exposed via the webserver or de-client calls.

Passing the --no-webserver flag can be done either directly, or via systemd. To do the latter, simply add a line to the service environment file (by default /etc/sysconfig/decisionengine), that reads:

DE_OPTS=”--no-webserver”

In the sample service environment file that comes with the Decision Engine RPM (located at package/systemd/decisionengine_sysconfig in the source code), this line is already present, but commented out, for convenience.