-
Notifications
You must be signed in to change notification settings - Fork 97
k8s-cloudwatch-adapter ExternalMetric reports incorrect values on EKS. #76
Comments
@chankh , Above same spec i used with cloudwatch api call and datapoint is showing corrrect , On EKS HPA we see always lower & incorrect values compare to cloudwatch metrics . Could you please review and help on this issue. |
@chankh , Enabled Debug=true on pod and noticed k8s-cloudwatch-adapter picking wrong value compare to cloudwatch datapoint ,Below details logs/screen shot from cloudwatch. Example: 2021-02-22T23:18:00Z = 1027.76 --> At same time on cloud watch we see 1302 value. "HTTP/1.1 200 OK
|
Hi, thanks for reporting the issue. However I am no longer with AWS and lost access to this repo. Please feel free to fork this project. |
In you earlier message where you have provided metric values as follows and point out that the first data point is incorrect. 1027.76 Which CloudWatch metric do these values correspond to? Are these the values of the metric computed using metric math expressions, that is, the metric with the ID e1 from the ExternalMetric spec you shared earlier? You also mention that when you used CloudWatch API, you got the correct values? |
Hi Vijay,
1027.76. -• incorrect data point value ,this value(lower) is not aggerated fully on cloud watch and not present on cloudwatch.
1291.04
1291.32
1292.56
1291.64
1) Yes, above metric o/p return is below express of e1 =m2/m1.
namespace: "AWS/SQS"
metricName: "NumberOfMessagesSent"
dimensions:
- name: QueueName
value: "internal-device-sqs"
period: 60
stat: Sum
unit: Count
returnData: false
- id: m1
expression: SEARCH('{"AWS/SQS",cluster} MetricName="xxxx-running-pods"',
'Average', 60)
label: "xxxx-kube-pods-running"
returnData: false
- id: e1
expression: m2 / m1
label: "avg-number-of-messages-each-pod"
returnData: true
2) If run same express using aws cloudwatch get-metrics API I don’t see above issue .
Thanks & Regards
Satish
From: vijayansarathy ***@***.***>
Reply-To: awslabs/k8s-cloudwatch-adapter ***@***.***>
Date: Tuesday, March 30, 2021 at 2:15 PM
To: awslabs/k8s-cloudwatch-adapter ***@***.***>
Cc: Satish Nandhyala <[email protected]>, Mention ***@***.***>
Subject: Re: [awslabs/k8s-cloudwatch-adapter] k8s-cloudwatch-adapter ExternalMetric reports incorrect values on EKS. (#76)
@snandhyala<https://urldefense.com/v3/__https:/github.com/snandhyala__;!!NL78yQ!l1ZDdTqJLUwvPQBcGC6MIgxCU4koHM47HRYbUXbx2MxEyQyY09NSY6xIuvyJXPI$>
In you earlier message where you have provided metric values as follows and point out that the first data point is incorrect.
1027.76
1291.04
1291.32
1292.56
1291.64
Which CloudWatch metric do these values correspond to? Are these the values of the metric computed using metric math expressions, that is, the metric with the ID e1 from the ExternalMetric spec you shared earlier?
You also mention that when you used CloudWatch API, you got the correct values?
Did you use the aws cloudwatch get-metric-data API with the exact same metric math expression?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/awslabs/k8s-cloudwatch-adapter/issues/76*issuecomment-810582496__;Iw!!NL78yQ!l1ZDdTqJLUwvPQBcGC6MIgxCU4koHM47HRYbUXbx2MxEyQyY09NSY6xIiesn0h4$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AH3RAUZM5GQYT2GMW4P4H5DTGI5ONANCNFSM4YBMKAWA__;!!NL78yQ!l1ZDdTqJLUwvPQBcGC6MIgxCU4koHM47HRYbUXbx2MxEyQyY09NSY6xITiGJSmE$>.
|
Thanks for the clarification. What I was trying to ascertain was whether there was any issue with the metrics aggregation step that is done within the CloudWatch service. That is the impression I got from the meeting today but it seems that that is not the case. Say, you are running the following command: aws cloudwatch get-metric-data --cli-input-json file://metric-data-query.json and the JSON file metric-data-query.json referenced above has the following metric math expression: [
{
"Id": "m1",
"Expression": "SEARCH({'AWS/SQS',cluster} MetricName='xxxx-running-pods','Average', 60)",
"Label": "xxxx-kube-pods-running",
"ReturnData": false,
"Period": 60
},
{
"Id": "m2",
"MetricStat": {
"Metric": {
"Namespace": "AWS/SQS",
"MetricName": "NumberOfMessagesSent",
"Dimensions": [
{
"Name": "QueueName",
"Value": "internal-device-sqs"
}
]
},
"Period": 60,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "e1",
"Expression": "m2 / m1",
"Label": "avg-number-of-messages-each-pod",
"ReturnData": true,
"Period": 60
},
] If the above CLI command gives you the right values, then the conclusion will be that there is nothing wrong with the original metric data in CloudWatch per se nor anything to do with the way it does the aggregation using the SUM operator. If the values output by the CloudWatch Metrics Adapter for Kubernetes are not consistent with the results returned from the CLI response, then the issue has to be with the adapter. The adapter is hard-coded to retrieve the metric data for the trailing 5 minutes. It will fetch the metric values at the top of every minute for the past 5 minutes. Are you seeing a behavior where the metric value that it reports for the most recent minute is always incorrect? It appears that the SQS metrics published to CloudWatch are of standard resolution, with data having a one-minute granularity. You are using the NumberOfMessagesSent which has a caveat attached to it. Read the footnote at the bottom of the table in the documentation. All that said, my suggestion would be to check if the adapter works in a predictable manner if you fetch the metric data up to, say, 2 minutes before the current time. That is, if you are fetching data at 10:00 AM, then gather metric values for the data points at 9:54, 9:55, 9:56, 9:57, 9:58. Just a suggestion. Of course, you will need to modify the adapter code to do that. I have pointed out exactly where you have to make the change. So, you may try rebuilding you version locally and check. Unfortunately, you are not going to get any official support for AWS for issues pertaining to this adapter. |
@vijayansarathy , Thanks for details, So to fix this issue we need to change cloudwatch adapter code to get metrics older than (-2mins) data point from cloudwatch ? |
@vijayansarathy ,client.go code is using aws module package ,Is there anyway get pass parameter in aws sdk get olderthen mins datapoint from cloudwatch ? As per my knowledge, we have only option is to pass endtime. Is there any better way passing dynamic system time - 2min on aws adk module. |
That is a suggestion. I am curious to see what the computed metric values looks like if you drop the most recent minute and go further backwards from there. Separately, I wanted to ask you the rationale behind using the metric NumberOfMessagesSent for autoscaling. Given that it is a counter, you will have to convert it into a rate/utilization metric which is what has led you down this path. But how about the metric ApproximateNumberOfMessagesVisible? I have seen customers use this more often for scaling out message consumers/subscribers in a fan-out scenario. If you use this, you will not have to calculate a rate metric like you do no. You will have to use the number of Pods to calculate the average number of messages visible per pod. Then, you can simply set a target value of, say, 500 and the HPA will scale the number of Pods out until (ApproximateNumberOfMessagesVisible/NPods) is less then 500. Would that not work? |
The GetMetricData API in AWS SDK for CloudWatch does support specifying StartTime and EndTime. |
@vijayansarathy , In starting HPA Autoscaling we are tried to use ApproximateNumberOfMessagesVisible and noticed pods are started more flapping in a short period (here we can try to use behavior option to control) to our huge load. Here our main goal is how can use cloudwatch adpater for other type of metrics on eks clusters. |
@vijayansarathy ,As per my understanding we need to change below code as example:
Change to login "now=" as below on above adapter code.
|
now := time.Now() |
@chankh
We are doing our n application autoscaling on QA EKS cluster using SQS/Custom metrics , We are using External AWS cloud watch adapter plugin on EKS to fetch metrics from Cloud watch . issue is intermittently we noticed HPA current value shows lower & incorrect compared to cloud watch data points , Due to this application autoscaling is not happening as excepted .We are trying to fetch Cloud watch metrics from the different account using IAM role.
Below is the externalmetric spec:
spec:
name: partnerapi-notifications-z1-goldenqa-msg-requests-avg-new
queries:
- id: m2
metricStat:
metric:
namespace: "AWS/SQS"
metricName: "NumberOfMessagesSent"
dimensions:
- name: QueueName
value: "internal-device-sqs"
period: 60
stat: Sum
unit: Count
returnData: false
- id: m1
expression: SEARCH('{"AWS/SQS",cluster} MetricName="xxxx-running-pods"',
'Average', 60)
label: "xxxx-kube-pods-running"
returnData: false
- id: e1
expression: m2 / m1
label: "avg-number-of-messages-each-pod"
returnData: true
region: eu-west-3
resource:
resource: deployment
roleArn: arn:aws:iam::xxxxxx:role/target-xxxxx
Adapater pod logs:
0221 10:38:32.324000 1 handler.go:67] externalMetricInfo: &{{ExternalMetric metrics.aws/v1alpha1} {partnerapi-notifications-z1-goldenqa-msg-requests-avg-new partner-notifications /apis/metrics.aws/v1alpha1/namespaces/partner-notifications/externalmetrics/partnerapi-notifications-z1-goldenqa-msg-requests-avg-new dfa4fff7-b053-4037-9452-0d907ce4c702 65889829 1 2021-02-21 10:22:21 +0000 UTC map[app.kubernetes.io/instance:golden-qa app.kubernetes.io/managed-by:Helm app.kubernetes.io/name:backend-partner-notifications environment:GOLDENQA helm.sh/chart:backend-partner-notifications-1.0.0] map[kubectl.kubernetes.io/last-applied-configuration:{"apiVersion":"metrics.aws/v1alpha1","kind":"ExternalMetric","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"golden-qa","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"backend-partner-notifications","environment":"GOLDENQA","helm.sh/chart":"backend-partner-notifications-1.0.0"},"name":"partnerapi-notifications-z1-goldenqa-msg-requests-avg-new","namespace":"partner-notifications"},"spec":{"name":"partnerapi-notifications-z1-goldenqa-msg-requests-avg-new","queries":[{"id":"m2","metricStat":{"metric":{"dimensions":[{"name":"QueueName","value":"internal-device-sqs"}],"metricName":"NumberOfMessagesSent","namespace":"AWS/SQS"},"period":60,"stat":"Sum","unit":"Count"},"returnData":false},{"expression":"SEARCH('{"AWS/SQS",cluster} MetricName="partner-notifications-running-pods"','Average', 60)","id":"m1","label":"partnerapi-notifications-kube-pods-running","returnData":false},{"expression":"m2 / m1","id":"e1","label":"avg-number-of-messages-each-pod","returnData":true}],"region":"eu-west-3","resource":{"resource":"deployment"},"roleArn":"arn:aws:iam::xxxxxx:role/target-cloudwatch-role"}}
] [] [] []} {partnerapi-notifications-z1-goldenqa-msg-requests-avg-new 0xc0004e4910 0xc0004e4900 [{ m2 {{[{QueueName internal-device-sqs}] NumberOfMessagesSent AWS/SQS} 60 Sum Count} 0xc00014a28f} {SEARCH('{"AWS/SQS",cluster} MetricName="partner-notifications-running-pods"','Average', 60) m1 partnerapi-notifications-kube-pods-running {{[] } 0 } 0xc00014a2ca} {m2 / m1 e1 avg-number-of-messages-each-pod {{[] } 0 } 0xc00014a30a}]}}
The text was updated successfully, but these errors were encountered: