k8s-cloudwatch-adapter ExternalMetric reports incorrect values on EKS. #76

snandhyala · 2021-02-22T22:49:32Z

We are doing our n application autoscaling on QA EKS cluster using SQS/Custom metrics , We are using External AWS cloud watch adapter plugin on EKS to fetch metrics from Cloud watch . issue is intermittently we noticed HPA current value shows lower & incorrect compared to cloud watch data points , Due to this application autoscaling is not happening as excepted .We are trying to fetch Cloud watch metrics from the different account using IAM role.

Below is the externalmetric spec:

spec:
name: partnerapi-notifications-z1-goldenqa-msg-requests-avg-new
queries:
- id: m2
metricStat:
metric:
namespace: "AWS/SQS"
metricName: "NumberOfMessagesSent"
dimensions:
- name: QueueName
value: "internal-device-sqs"
period: 60
stat: Sum
unit: Count
returnData: false
- id: m1
expression: SEARCH('{"AWS/SQS",cluster} MetricName="xxxx-running-pods"',
'Average', 60)
label: "xxxx-kube-pods-running"
returnData: false
- id: e1
expression: m2 / m1
label: "avg-number-of-messages-each-pod"
returnData: true
region: eu-west-3
resource:
resource: deployment
roleArn: arn:aws:iam::xxxxxx:role/target-xxxxx

Adapater pod logs:

0221 10:38:32.324000 1 handler.go:67] externalMetricInfo: &{{ExternalMetric metrics.aws/v1alpha1} {partnerapi-notifications-z1-goldenqa-msg-requests-avg-new partner-notifications /apis/metrics.aws/v1alpha1/namespaces/partner-notifications/externalmetrics/partnerapi-notifications-z1-goldenqa-msg-requests-avg-new dfa4fff7-b053-4037-9452-0d907ce4c702 65889829 1 2021-02-21 10:22:21 +0000 UTC map[app.kubernetes.io/instance:golden-qa app.kubernetes.io/managed-by:Helm app.kubernetes.io/name:backend-partner-notifications environment:GOLDENQA helm.sh/chart:backend-partner-notifications-1.0.0] map[kubectl.kubernetes.io/last-applied-configuration:{"apiVersion":"metrics.aws/v1alpha1","kind":"ExternalMetric","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"golden-qa","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"backend-partner-notifications","environment":"GOLDENQA","helm.sh/chart":"backend-partner-notifications-1.0.0"},"name":"partnerapi-notifications-z1-goldenqa-msg-requests-avg-new","namespace":"partner-notifications"},"spec":{"name":"partnerapi-notifications-z1-goldenqa-msg-requests-avg-new","queries":[{"id":"m2","metricStat":{"metric":{"dimensions":[{"name":"QueueName","value":"internal-device-sqs"}],"metricName":"NumberOfMessagesSent","namespace":"AWS/SQS"},"period":60,"stat":"Sum","unit":"Count"},"returnData":false},{"expression":"SEARCH('{"AWS/SQS",cluster} MetricName="partner-notifications-running-pods"','Average', 60)","id":"m1","label":"partnerapi-notifications-kube-pods-running","returnData":false},{"expression":"m2 / m1","id":"e1","label":"avg-number-of-messages-each-pod","returnData":true}],"region":"eu-west-3","resource":{"resource":"deployment"},"roleArn":"arn:aws:iam::xxxxxx:role/target-cloudwatch-role"}}
] [] [] []} {partnerapi-notifications-z1-goldenqa-msg-requests-avg-new 0xc0004e4910 0xc0004e4900 [{ m2 {{[{QueueName internal-device-sqs}] NumberOfMessagesSent AWS/SQS} 60 Sum Count} 0xc00014a28f} {SEARCH('{"AWS/SQS",cluster} MetricName="partner-notifications-running-pods"','Average', 60) m1 partnerapi-notifications-kube-pods-running {{[] } 0 } 0xc00014a2ca} {m2 / m1 e1 avg-number-of-messages-each-pod {{[] } 0 } 0xc00014a30a}]}}

snandhyala · 2021-02-22T22:53:15Z

@chankh ,

Above same spec i used with cloudwatch api call and datapoint is showing corrrect , On EKS HPA we see always lower & incorrect values compare to cloudwatch metrics . Could you please review and help on this issue.

snandhyala · 2021-02-22T23:32:15Z

@chankh ,

Enabled Debug=true on pod and noticed k8s-cloudwatch-adapter picking wrong value compare to cloudwatch datapoint ,Below details logs/screen shot from cloudwatch.

Example: 2021-02-22T23:18:00Z = 1027.76 --> At same time on cloud watch we see 1302 value.

"HTTP/1.1 200 OK
Content-Length: 1002
Content-Type: text/xml
Date: Mon, 22 Feb 2021 23:19:18 GMT
X-Amzn-Requestid: 6398f40e-15fc-48de-b86b-4d62ea6e99a4

2021/02/22 23:19:19

2021-02-22T23:18:00Z
2021-02-22T23:17:00Z
2021-02-22T23:16:00Z
2021-02-22T23:15:00Z
2021-02-22T23:14:00Z

1027.76
1291.04
1291.32
1292.56
1291.64

avg-number-of-messages-each-pod partner-notifications-running-pods
e1
Complete

6398f40e-15fc-48de-b86b-4d62ea6e99a4

"

snandhyala · 2021-02-22T23:35:49Z

adapater-debuug.log

chankh · 2021-02-24T07:11:54Z

Hi, thanks for reporting the issue. However I am no longer with AWS and lost access to this repo. Please feel free to fork this project.

vijayansarathy · 2021-03-30T21:15:00Z

@snandhyala

In you earlier message where you have provided metric values as follows and point out that the first data point is incorrect.

1027.76
1291.04
1291.32
1292.56
1291.64

Which CloudWatch metric do these values correspond to? Are these the values of the metric computed using metric math expressions, that is, the metric with the ID e1 from the ExternalMetric spec you shared earlier?

You also mention that when you used CloudWatch API, you got the correct values?
Did you use the aws cloudwatch get-metric-data API with the exact same metric math expression?

snandhyala · 2021-03-30T22:00:03Z

Hi Vijay, 1027.76. -• incorrect data point value ,this value(lower) is not aggerated fully on cloud watch and not present on cloudwatch. 1291.04 1291.32 1292.56 1291.64 1) Yes, above metric o/p return is below express of e1 =m2/m1. namespace: "AWS/SQS" metricName: "NumberOfMessagesSent" dimensions: - name: QueueName value: "internal-device-sqs" period: 60 stat: Sum unit: Count returnData: false - id: m1 expression: SEARCH('{"AWS/SQS",cluster} MetricName="xxxx-running-pods"', 'Average', 60) label: "xxxx-kube-pods-running" returnData: false - id: e1 expression: m2 / m1 label: "avg-number-of-messages-each-pod" returnData: true 2) If run same express using aws cloudwatch get-metrics API I don’t see above issue . Thanks & Regards Satish From: vijayansarathy ***@***.***> Reply-To: awslabs/k8s-cloudwatch-adapter ***@***.***> Date: Tuesday, March 30, 2021 at 2:15 PM To: awslabs/k8s-cloudwatch-adapter ***@***.***> Cc: Satish Nandhyala <[email protected]>, Mention ***@***.***> Subject: Re: [awslabs/k8s-cloudwatch-adapter] k8s-cloudwatch-adapter ExternalMetric reports incorrect values on EKS. (#76) @snandhyala<https://urldefense.com/v3/__https:/github.com/snandhyala__;!!NL78yQ!l1ZDdTqJLUwvPQBcGC6MIgxCU4koHM47HRYbUXbx2MxEyQyY09NSY6xIuvyJXPI$> In you earlier message where you have provided metric values as follows and point out that the first data point is incorrect. 1027.76 1291.04 1291.32 1292.56 1291.64 Which CloudWatch metric do these values correspond to? Are these the values of the metric computed using metric math expressions, that is, the metric with the ID e1 from the ExternalMetric spec you shared earlier? You also mention that when you used CloudWatch API, you got the correct values? Did you use the aws cloudwatch get-metric-data API with the exact same metric math expression? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https:/github.com/awslabs/k8s-cloudwatch-adapter/issues/76*issuecomment-810582496__;Iw!!NL78yQ!l1ZDdTqJLUwvPQBcGC6MIgxCU4koHM47HRYbUXbx2MxEyQyY09NSY6xIiesn0h4$>, or unsubscribe<https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AH3RAUZM5GQYT2GMW4P4H5DTGI5ONANCNFSM4YBMKAWA__;!!NL78yQ!l1ZDdTqJLUwvPQBcGC6MIgxCU4koHM47HRYbUXbx2MxEyQyY09NSY6xITiGJSmE$>.

vijayansarathy · 2021-03-31T01:50:57Z

@snandhyala

Thanks for the clarification.

What I was trying to ascertain was whether there was any issue with the metrics aggregation step that is done within the CloudWatch service. That is the impression I got from the meeting today but it seems that that is not the case.

Say, you are running the following command:

aws cloudwatch get-metric-data --cli-input-json file://metric-data-query.json

and the JSON file metric-data-query.json referenced above has the following metric math expression:

[
    {
        "Id": "m1",
        "Expression": "SEARCH({'AWS/SQS',cluster} MetricName='xxxx-running-pods','Average', 60)",
        "Label": "xxxx-kube-pods-running",
        "ReturnData": false,
        "Period": 60
    },    
    {
        "Id": "m2",
        "MetricStat": {
            "Metric": {
                "Namespace": "AWS/SQS",
                "MetricName": "NumberOfMessagesSent",
                "Dimensions": [
                    {
                        "Name": "QueueName",
                        "Value": "internal-device-sqs"
                    }
                ]
            },
            "Period": 60,
            "Stat": "Sum"
        },
        "ReturnData": false
    },
    {
        "Id": "e1",
        "Expression": "m2 / m1",
        "Label": "avg-number-of-messages-each-pod",
        "ReturnData": true,
        "Period": 60
    },    
]

If the above CLI command gives you the right values, then the conclusion will be that there is nothing wrong with the original metric data in CloudWatch per se nor anything to do with the way it does the aggregation using the SUM operator.

If the values output by the CloudWatch Metrics Adapter for Kubernetes are not consistent with the results returned from the CLI response, then the issue has to be with the adapter. The adapter is hard-coded to retrieve the metric data for the trailing 5 minutes. It will fetch the metric values at the top of every minute for the past 5 minutes. Are you seeing a behavior where the metric value that it reports for the most recent minute is always incorrect? It appears that the SQS metrics published to CloudWatch are of standard resolution, with data having a one-minute granularity. You are using the NumberOfMessagesSent which has a caveat attached to it. Read the footnote at the bottom of the table in the documentation.

All that said, my suggestion would be to check if the adapter works in a predictable manner if you fetch the metric data up to, say, 2 minutes before the current time. That is, if you are fetching data at 10:00 AM, then gather metric values for the data points at 9:54, 9:55, 9:56, 9:57, 9:58. Just a suggestion. Of course, you will need to modify the adapter code to do that. I have pointed out exactly where you have to make the change. So, you may try rebuilding you version locally and check. Unfortunately, you are not going to get any official support for AWS for issues pertaining to this adapter.

devoprock · 2021-03-31T02:16:10Z

@vijayansarathy , Thanks for details, So to fix this issue we need to change cloudwatch adapter code to get metrics older than (-2mins) data point from cloudwatch ?

devoprock · 2021-03-31T02:19:57Z

@vijayansarathy ,client.go code is using aws module package ,Is there anyway get pass parameter in aws sdk get olderthen mins datapoint from cloudwatch ? As per my knowledge, we have only option is to pass endtime. Is there any better way passing dynamic system time - 2min on aws adk module.

vijayansarathy · 2021-03-31T02:23:57Z

That is a suggestion. I am curious to see what the computed metric values looks like if you drop the most recent minute and go further backwards from there.

Separately, I wanted to ask you the rationale behind using the metric NumberOfMessagesSent for autoscaling. Given that it is a counter, you will have to convert it into a rate/utilization metric which is what has led you down this path.

But how about the metric ApproximateNumberOfMessagesVisible? I have seen customers use this more often for scaling out message consumers/subscribers in a fan-out scenario. If you use this, you will not have to calculate a rate metric like you do no. You will have to use the number of Pods to calculate the average number of messages visible per pod. Then, you can simply set a target value of, say, 500 and the HPA will scale the number of Pods out until (ApproximateNumberOfMessagesVisible/NPods) is less then 500. Would that not work?

vijayansarathy · 2021-03-31T02:29:09Z

The GetMetricData API in AWS SDK for CloudWatch does support specifying StartTime and EndTime.
Unfortunately, the adapter is not exposing this but is hard-coding it to the trailing 5 minutes.
It is a very simple code change to make it go for from, say, (T-2) to (T-7), where T is the current time.

devoprock · 2021-03-31T02:29:54Z

@vijayansarathy , In starting HPA Autoscaling we are tried to use ApproximateNumberOfMessagesVisible and noticed pods are started more flapping in a short period (here we can try to use behavior option to control) to our huge load. Here our main goal is how can use cloudwatch adpater for other type of metrics on eks clusters.

devoprock · 2021-03-31T02:52:18Z

@vijayansarathy ,As per my understanding we need to change below code as example:

now := time.Now()
endTime := time.Date(now.Year(), now.Month(), now.Day(), now.Hour(), now.Minute(), 0, 0, now.Location())
// CloudWatch metrics have latency, we will grab in a 5 minute window and extract the latest value
startTime := endTime.Add(-5 * time.Minute)

Change to login "now=" as below on above adapter code.

import time
print time.ctime()
Tue Mar 30 19:48:41 2021
pasttime = time.time() - 120
print time.ctime(future)
Tue Mar 30 19:46:49 2021

vijayansarathy · 2021-03-31T22:01:02Z

now := time.Now()
endTime := time.Date(now.Year(), now.Month(), now.Day(), now.Hour(), now.Minute(), 0, 0, now.Location())
endTime = endTime.Add(-2 * time.Minute)
startTime := endTime.Add(-5 * time.Minute)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8s-cloudwatch-adapter ExternalMetric reports incorrect values on EKS. #76

k8s-cloudwatch-adapter ExternalMetric reports incorrect values on EKS. #76

snandhyala commented Feb 22, 2021

snandhyala commented Feb 22, 2021

snandhyala commented Feb 22, 2021 •

edited

Loading

snandhyala commented Feb 22, 2021 •

edited

Loading

chankh commented Feb 24, 2021

vijayansarathy commented Mar 30, 2021

snandhyala commented Mar 30, 2021 via email

vijayansarathy commented Mar 31, 2021 •

edited

Loading

devoprock commented Mar 31, 2021

devoprock commented Mar 31, 2021

vijayansarathy commented Mar 31, 2021

vijayansarathy commented Mar 31, 2021

devoprock commented Mar 31, 2021

devoprock commented Mar 31, 2021

vijayansarathy commented Mar 31, 2021

k8s-cloudwatch-adapter ExternalMetric reports incorrect values on EKS. #76

k8s-cloudwatch-adapter ExternalMetric reports incorrect values on EKS. #76

Comments

snandhyala commented Feb 22, 2021

snandhyala commented Feb 22, 2021

snandhyala commented Feb 22, 2021 • edited Loading

"HTTP/1.1 200 OK Content-Length: 1002 Content-Type: text/xml Date: Mon, 22 Feb 2021 23:19:18 GMT X-Amzn-Requestid: 6398f40e-15fc-48de-b86b-4d62ea6e99a4

snandhyala commented Feb 22, 2021 • edited Loading

chankh commented Feb 24, 2021

vijayansarathy commented Mar 30, 2021

snandhyala commented Mar 30, 2021 via email

vijayansarathy commented Mar 31, 2021 • edited Loading

devoprock commented Mar 31, 2021

devoprock commented Mar 31, 2021

vijayansarathy commented Mar 31, 2021

vijayansarathy commented Mar 31, 2021

devoprock commented Mar 31, 2021

devoprock commented Mar 31, 2021

vijayansarathy commented Mar 31, 2021

snandhyala commented Feb 22, 2021 •

edited

Loading

"HTTP/1.1 200 OK
Content-Length: 1002
Content-Type: text/xml
Date: Mon, 22 Feb 2021 23:19:18 GMT
X-Amzn-Requestid: 6398f40e-15fc-48de-b86b-4d62ea6e99a4

snandhyala commented Feb 22, 2021 •

edited

Loading

vijayansarathy commented Mar 31, 2021 •

edited

Loading