-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support custom histogram buckets for Prometheus Timer Metrics #17689
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, otherwise looks good to me.
website/.spelling
Outdated
@@ -360,6 +360,7 @@ hashcode | |||
hashtable | |||
high-QPS | |||
historicals | |||
histogramBuckets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be needed since this word occurs within code blocks in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, pushing changes to the following commit.
@@ -1,5 +1,5 @@ | |||
{ | |||
"query/time" : { "dimensions" : ["dataSource", "type"], "type" : "timer", "conversionFactor": 1000.0, "help": "Seconds taken to complete a query."}, | |||
"query/time" : { "dimensions" : ["dataSource", "type"], "type" : "timer", "conversionFactor": 1000.0, "histogramBuckets": [0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 30.0, 60.0, 120.0, 300.0], "help": "Seconds taken to complete a query."}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why override this? Isn't this already the default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it is the default value. My rationale behind overriding this value was that users will find an example to work with.
@capistrant, do you have any example metric in mind that we can use to show a sample for override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashwintumma23 task/run/time
is an example of a metric that we found is not very useful with the defaults
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @capistrant ! What do you think the default values can be? I am thinking [1.0, 2.5, 5.0, 7.5, 10.0, 30.0, 60.0, 120.0, 300.0, +Inf]
to capture . But again, this is only a sample use case.
On another thought, do we need to specify an example default value here, or let users/ cluster administrators configure it as they see fit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only administrators will truly know the proper values to put. If we show an example of how to override, I think that this metric is a good choice though as it should make sense to operators. I would personally add 600, 1200, 1800
on to the end of the override, but that is just based off of me and how we have longer running batch jobs that I'd prefer don't all get tossed into inf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! A batch job's time will be totally dependent on the use case. As we have already documented how to configure the parameter, is it fine if we don't show it via an example in the defaultMetrics.json
file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, I'm ok with us leaving it to the operator to decide if/when they want to take advantage of this change
Sounds good, removing the example from this pull request in the next commit.
@@ -90,6 +90,7 @@ Prometheus metric path is organized using the following schema: | |||
"dimensions" : <dimension list>, | |||
"type" : <timer|counter|gauge>, | |||
"conversionFactor": <conversionFactor>, | |||
"histogramBuckets": <bucket values>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"histogramBuckets": <bucket values>, | |
"histogramBuckets": <array of bucket values for timer metric>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, pushing changes to the following commit.
@@ -40,7 +40,7 @@ All the configuration parameters for the Prometheus emitter are under `druid.emi | |||
| `druid.emitter.prometheus.strategy` | The strategy to expose prometheus metrics. <br/>Should be one of `exporter` and `pushgateway`. Default strategy `exporter` would expose metrics for scraping purpose. Peon tasks (short-lived jobs) should use `pushgateway` strategy. | yes | exporter | | |||
| `druid.emitter.prometheus.port` | The port on which to expose the prometheus HTTPServer. Required if using `exporter` strategy. | no | none | | |||
| `druid.emitter.prometheus.namespace` | Optional metric namespace. Must match the regex `[a-zA-Z_:][a-zA-Z0-9_:]*` | no | druid | | |||
| `druid.emitter.prometheus.dimensionMapPath` | JSON file defining the Prometheus metric type, desired dimensions, help text, and conversionFactor for every Druid metric. | no | Default mapping provided. See below. | | |||
| `druid.emitter.prometheus.dimensionMapPath` | JSON file defining the Prometheus metric type, desired dimensions, conversionFactor, histogram buckets and help text for every Druid metric. | no | Default mapping provided. See below. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add a line in the ### Metric mapping
section about the default values of the histogram buckets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, pushing the changes in the following commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, I'm ok with us leaving it to the operator to decide if/when they want to take advantage of this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Referencing quote/ comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I'm excited to pull it back into my companies fork asap so we can get better metrics on batch ingest times. I'll plan to defer merge until @kfaraz reviews the changes you made in response to his review
Fixes #17684.
Description
timer
type metrics in Prometheus Emitter.[0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0, 30.0, 60.0, 120.0, 300.0]
is used. Default value is picked from the current code itself, where we were hard-coding it.Renamed the test resources file ...
defaultMetricsTest.json
was introduced in [Prometheus Emitter] Add to code coverage and remove code smell #17362, and contains a metric for an invalid test case, so it would be better if it is named appropriately todefaultInvalidMetricsTest.json
.defaultMetricsTest.json
file to test customhistogramBuckets
valueRelease note
timer
metrics from Prometheus emitterKey changed/added classes in this PR
DimensionsAndCollector.java
Metrics.java
MetricsTest.java
This PR has: