Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom histogram buckets for Prometheus Timer Metrics #17689

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ashwintumma23
Copy link
Contributor

@ashwintumma23 ashwintumma23 commented Jan 31, 2025

Fixes #17684.

Description

  • Introduces support for adding custom histogram buckets to all timer type metrics in Prometheus Emitter.
  • If the value is not specified in the defaultMetrics.json file, then the default value of [0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 30.0, 60.0, 120.0, 300.0] is used. Default value is picked from the current code itself, where we were hard-coding it.

Renamed the test resources file ...

Release note

  • Custom Histogram buckets can now be configured for timer metrics from Prometheus emitter

Key changed/added classes in this PR
  • DimensionsAndCollector.java
  • Metrics.java
  • MetricsTest.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, otherwise looks good to me.

@@ -360,6 +360,7 @@ hashcode
hashtable
high-QPS
historicals
histogramBuckets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be needed since this word occurs within code blocks in the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, pushing changes to the following commit.

@@ -1,5 +1,5 @@
{
"query/time" : { "dimensions" : ["dataSource", "type"], "type" : "timer", "conversionFactor": 1000.0, "help": "Seconds taken to complete a query."},
"query/time" : { "dimensions" : ["dataSource", "type"], "type" : "timer", "conversionFactor": 1000.0, "histogramBuckets": [0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 30.0, 60.0, 120.0, 300.0], "help": "Seconds taken to complete a query."},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why override this? Isn't this already the default value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it is the default value. My rationale behind overriding this value was that users will find an example to work with.
@capistrant, do you have any example metric in mind that we can use to show a sample for override?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashwintumma23 task/run/time is an example of a metric that we found is not very useful with the defaults

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @capistrant ! What do you think the default values can be? I am thinking [1.0, 2.5, 5.0, 7.5, 10.0, 30.0, 60.0, 120.0, 300.0, +Inf] to capture . But again, this is only a sample use case.

On another thought, do we need to specify an example default value here, or let users/ cluster administrators configure it as they see fit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only administrators will truly know the proper values to put. If we show an example of how to override, I think that this metric is a good choice though as it should make sense to operators. I would personally add 600, 1200, 1800 on to the end of the override, but that is just based off of me and how we have longer running batch jobs that I'd prefer don't all get tossed into inf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! A batch job's time will be totally dependent on the use case. As we have already documented how to configure the parameter, is it fine if we don't show it via an example in the defaultMetrics.json file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, I'm ok with us leaving it to the operator to decide if/when they want to take advantage of this change

Sounds good, removing the example from this pull request in the next commit.

@@ -90,6 +90,7 @@ Prometheus metric path is organized using the following schema:
"dimensions" : <dimension list>,
"type" : <timer|counter|gauge>,
"conversionFactor": <conversionFactor>,
"histogramBuckets": <bucket values>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"histogramBuckets": <bucket values>,
"histogramBuckets": <array of bucket values for timer metric>,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, pushing changes to the following commit.

@@ -40,7 +40,7 @@ All the configuration parameters for the Prometheus emitter are under `druid.emi
| `druid.emitter.prometheus.strategy` | The strategy to expose prometheus metrics. <br/>Should be one of `exporter` and `pushgateway`. Default strategy `exporter` would expose metrics for scraping purpose. Peon tasks (short-lived jobs) should use `pushgateway` strategy. | yes | exporter |
| `druid.emitter.prometheus.port` | The port on which to expose the prometheus HTTPServer. Required if using `exporter` strategy. | no | none |
| `druid.emitter.prometheus.namespace` | Optional metric namespace. Must match the regex `[a-zA-Z_:][a-zA-Z0-9_:]*` | no | druid |
| `druid.emitter.prometheus.dimensionMapPath` | JSON file defining the Prometheus metric type, desired dimensions, help text, and conversionFactor for every Druid metric. | no | Default mapping provided. See below. |
| `druid.emitter.prometheus.dimensionMapPath` | JSON file defining the Prometheus metric type, desired dimensions, conversionFactor, histogram buckets and help text for every Druid metric. | no | Default mapping provided. See below. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a line in the ### Metric mapping section about the default values of the histogram buckets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, pushing the changes in the following commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, I'm ok with us leaving it to the operator to decide if/when they want to take advantage of this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Referencing quote/ comment below.

Copy link
Contributor

@capistrant capistrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I'm excited to pull it back into my companies fork asap so we can get better metrics on batch ingest times. I'll plan to defer merge until @kfaraz reviews the changes you made in response to his review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prometheus-emitter (contrib-extension) - Allow custom bucket ranges for Timer type metrics
3 participants