-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup stale metrics from time to time #32
Comments
@flaviostutz If a counter reset and increases between two scraps, is this meter still reliable? Micrometer supports metrics unregister operations. But I know that is very difficult to maintain records to define if a metric is stale or not |
1: If I understood your question, yes. It is reliable and Prometheus was designed to handle this. A “soft reset” will have the same behavior as when a server is restarted and the counters are reset. Did I understand your reasoning?
2: We don’t have to keep track of each metric whatever it is stale or not. If we simply do a “reset” each 48h, for example, everything will be cleaned up and at most we will lose observations since last scrape (<15-30s). A more elaborated implementation (default libs surely don’t support this) of selectively removing “stale” metric instances would rely on a LRU cache implementation. This way we could even limit the number and size of the metrics, with much more control over load, but I think this could be left to another discussion. I don’t know if the lib we use supports resets. Maybe we will have to extend it.
…Sent from my iPhone
On 12 May 2021, at 17:11, Gilliard Macedo ***@***.***> wrote:
@flaviostutz If a counter reset and increases between two scraps, is this meter still reliable?
Micrometer supports metrics unregister operations. But I know that is very difficult to maintain records to define if a metric is stale or not
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@flaviostutz Today this is already possible using MetricRegistry and executing the clean method. In both we can reset all metrics, we do this in tests, maybe we can create a scheduler process to execute this clean and set a default time to do this, and add this in both libs. |
Great! Instead of using a “schedule” thread, which would be hard (and even not recommended) to do in servlets, maybe we can use the same method for observation and check for the last reset timestamp and if elapsed time is >48h, for example, we could call the mentioned method!
…Sent from my iPhone
On 13 May 2021, at 08:05, Carlos Eduardo Panarello ***@***.***> wrote:
@flaviostutz Today this is already possible using MetricRegistry and executing the clean method. In both we can reset all metrics, we do this in tests, maybe we can create a scheduler process to execute this clean and set a default time to do this, and add this in both libs.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
After just a single observation of a metric, it will be reported forever, even with its count freezed in "1", "2" etc for days or months. When those "stale" metrics are scraped and processed by Prometheus it will compare this metrics to its previous value on the datastore (that will be the same) and it will simply discard it. Now imagine you have hundreds of error-info messages of even thousand of different paths that are not used anymore and in every /metrics scrape that is returned, wasting CPU and Network resources until you restart the server. This is happening with us in production.
Proposal
Perform a "soft reset" of all metrics in memory each 48h in order to reduce stale metrics. This way all metrics will be erased and on the next metric Observation it will become "1" again (Prometheus is designed to handle this kind of discontinuation/resets in series).
The text was updated successfully, but these errors were encountered: