Cleanup stale metrics from time to time #32

flaviostutz · 2021-05-12T17:34:09Z

After just a single observation of a metric, it will be reported forever, even with its count freezed in "1", "2" etc for days or months. When those "stale" metrics are scraped and processed by Prometheus it will compare this metrics to its previous value on the datastore (that will be the same) and it will simply discard it. Now imagine you have hundreds of error-info messages of even thousand of different paths that are not used anymore and in every /metrics scrape that is returned, wasting CPU and Network resources until you restart the server. This is happening with us in production.

Proposal

Perform a "soft reset" of all metrics in memory each 48h in order to reduce stale metrics. This way all metrics will be erased and on the next metric Observation it will become "1" again (Prometheus is designed to handle this kind of discontinuation/resets in series).

gilliardmacedo · 2021-05-12T20:11:13Z

@flaviostutz If a counter reset and increases between two scraps, is this meter still reliable?

Micrometer supports metrics unregister operations. But I know that is very difficult to maintain records to define if a metric is stale or not

flaviostutz · 2021-05-12T21:05:18Z

1: If I understood your question, yes. It is reliable and Prometheus was designed to handle this. A “soft reset” will have the same behavior as when a server is restarted and the counters are reset. Did I understand your reasoning? 2: We don’t have to keep track of each metric whatever it is stale or not. If we simply do a “reset” each 48h, for example, everything will be cleaned up and at most we will lose observations since last scrape (<15-30s). A more elaborated implementation (default libs surely don’t support this) of selectively removing “stale” metric instances would rely on a LRU cache implementation. This way we could even limit the number and size of the metrics, with much more control over load, but I think this could be left to another discussion. I don’t know if the lib we use supports resets. Maybe we will have to extend it.

…

Sent from my iPhone

On 12 May 2021, at 17:11, Gilliard Macedo ***@***.***> wrote: @flaviostutz If a counter reset and increases between two scraps, is this meter still reliable? Micrometer supports metrics unregister operations. But I know that is very difficult to maintain records to define if a metric is stale or not — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

CarlosPanarello · 2021-05-13T11:04:45Z

@flaviostutz Today this is already possible using MetricRegistry and executing the clean method. In both we can reset all metrics, we do this in tests, maybe we can create a scheduler process to execute this clean and set a default time to do this, and add this in both libs.

flaviostutz · 2021-05-13T15:59:14Z

Great! Instead of using a “schedule” thread, which would be hard (and even not recommended) to do in servlets, maybe we can use the same method for observation and check for the last reset timestamp and if elapsed time is >48h, for example, we could call the mentioned method!

…

Sent from my iPhone

On 13 May 2021, at 08:05, Carlos Eduardo Panarello ***@***.***> wrote: @flaviostutz Today this is already possible using MetricRegistry and executing the clean method. In both we can reset all metrics, we do this in tests, maybe we can create a scheduler process to execute this clean and set a default time to do this, and add this in both libs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

flaviostutz added enhancement New feature or request help wanted Extra attention is needed labels May 12, 2021

This was referenced May 12, 2021

Cleanup stale metrics from time to time labbsr0x/flask-monitor#13

Open

Cleanup stale metrics from time to time labbsr0x/fiber-monitor#11

Open

Cleanup stale metrics from time to time labbsr0x/mux-monitor#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup stale metrics from time to time #32

Cleanup stale metrics from time to time #32

flaviostutz commented May 12, 2021 •

edited

Loading

gilliardmacedo commented May 12, 2021

flaviostutz commented May 12, 2021 via email

CarlosPanarello commented May 13, 2021

flaviostutz commented May 13, 2021 via email

Cleanup stale metrics from time to time #32

Cleanup stale metrics from time to time #32

Comments

flaviostutz commented May 12, 2021 • edited Loading

Proposal

gilliardmacedo commented May 12, 2021

flaviostutz commented May 12, 2021 via email

CarlosPanarello commented May 13, 2021

flaviostutz commented May 13, 2021 via email

flaviostutz commented May 12, 2021 •

edited

Loading