Skip to content
This repository has been archived by the owner on Sep 24, 2021. It is now read-only.

k8s-cloudwatch-adapter ExternalMetric reports incorrect values on EKS. #76

Open
snandhyala opened this issue Feb 22, 2021 · 14 comments
Open

Comments

@snandhyala
Copy link

@chankh

We are doing our n application autoscaling on QA EKS cluster using SQS/Custom metrics , We are using External AWS cloud watch adapter plugin on EKS to fetch metrics from Cloud watch . issue is intermittently we noticed HPA current value shows lower & incorrect compared to cloud watch data points , Due to this application autoscaling is not happening as excepted .We are trying to fetch Cloud watch metrics from the different account using IAM role.

Below is the externalmetric spec:

spec:
name: partnerapi-notifications-z1-goldenqa-msg-requests-avg-new
queries:
- id: m2
metricStat:
metric:
namespace: "AWS/SQS"
metricName: "NumberOfMessagesSent"
dimensions:
- name: QueueName
value: "internal-device-sqs"
period: 60
stat: Sum
unit: Count
returnData: false
- id: m1
expression: SEARCH('{"AWS/SQS",cluster} MetricName="xxxx-running-pods"',
'Average', 60)
label: "xxxx-kube-pods-running"
returnData: false
- id: e1
expression: m2 / m1
label: "avg-number-of-messages-each-pod"
returnData: true
region: eu-west-3
resource:
resource: deployment
roleArn: arn:aws:iam::xxxxxx:role/target-xxxxx


Adapater pod logs:

0221 10:38:32.324000 1 handler.go:67] externalMetricInfo: &{{ExternalMetric metrics.aws/v1alpha1} {partnerapi-notifications-z1-goldenqa-msg-requests-avg-new partner-notifications /apis/metrics.aws/v1alpha1/namespaces/partner-notifications/externalmetrics/partnerapi-notifications-z1-goldenqa-msg-requests-avg-new dfa4fff7-b053-4037-9452-0d907ce4c702 65889829 1 2021-02-21 10:22:21 +0000 UTC map[app.kubernetes.io/instance:golden-qa app.kubernetes.io/managed-by:Helm app.kubernetes.io/name:backend-partner-notifications environment:GOLDENQA helm.sh/chart:backend-partner-notifications-1.0.0] map[kubectl.kubernetes.io/last-applied-configuration:{"apiVersion":"metrics.aws/v1alpha1","kind":"ExternalMetric","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"golden-qa","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"backend-partner-notifications","environment":"GOLDENQA","helm.sh/chart":"backend-partner-notifications-1.0.0"},"name":"partnerapi-notifications-z1-goldenqa-msg-requests-avg-new","namespace":"partner-notifications"},"spec":{"name":"partnerapi-notifications-z1-goldenqa-msg-requests-avg-new","queries":[{"id":"m2","metricStat":{"metric":{"dimensions":[{"name":"QueueName","value":"internal-device-sqs"}],"metricName":"NumberOfMessagesSent","namespace":"AWS/SQS"},"period":60,"stat":"Sum","unit":"Count"},"returnData":false},{"expression":"SEARCH('{"AWS/SQS",cluster} MetricName="partner-notifications-running-pods"','Average', 60)","id":"m1","label":"partnerapi-notifications-kube-pods-running","returnData":false},{"expression":"m2 / m1","id":"e1","label":"avg-number-of-messages-each-pod","returnData":true}],"region":"eu-west-3","resource":{"resource":"deployment"},"roleArn":"arn:aws:iam::xxxxxx:role/target-cloudwatch-role"}}
] [] [] []} {partnerapi-notifications-z1-goldenqa-msg-requests-avg-new 0xc0004e4910 0xc0004e4900 [{ m2 {{[{QueueName internal-device-sqs}] NumberOfMessagesSent AWS/SQS} 60 Sum Count} 0xc00014a28f} {SEARCH('{"AWS/SQS",cluster} MetricName="partner-notifications-running-pods"','Average', 60) m1 partnerapi-notifications-kube-pods-running {{[] } 0 } 0xc00014a2ca} {m2 / m1 e1 avg-number-of-messages-each-pod {{[] } 0 } 0xc00014a30a}]}}

@snandhyala
Copy link
Author

@chankh ,

Above same spec i used with cloudwatch api call and datapoint is showing corrrect , On EKS HPA we see always lower & incorrect values compare to cloudwatch metrics . Could you please review and help on this issue.

@snandhyala
Copy link
Author

snandhyala commented Feb 22, 2021

@chankh ,

Enabled Debug=true on pod and noticed k8s-cloudwatch-adapter picking wrong value compare to cloudwatch datapoint ,Below details logs/screen shot from cloudwatch.

Example: 2021-02-22T23:18:00Z = 1027.76 --> At same time on cloud watch we see 1302 value.

"HTTP/1.1 200 OK
Content-Length: 1002
Content-Type: text/xml
Date: Mon, 22 Feb 2021 23:19:18 GMT
X-Amzn-Requestid: 6398f40e-15fc-48de-b86b-4d62ea6e99a4

2021/02/22 23:19:19




2021-02-22T23:18:00Z
2021-02-22T23:17:00Z
2021-02-22T23:16:00Z
2021-02-22T23:15:00Z
2021-02-22T23:14:00Z


1027.76
1291.04
1291.32
1292.56
1291.64

avg-number-of-messages-each-pod partner-notifications-running-pods
e1
Complete





6398f40e-15fc-48de-b86b-4d62ea6e99a4

"

@snandhyala
Copy link
Author

snandhyala commented Feb 22, 2021

adapater-debuug.log

@chankh
Copy link
Contributor

chankh commented Feb 24, 2021

Hi, thanks for reporting the issue. However I am no longer with AWS and lost access to this repo. Please feel free to fork this project.

@vijayansarathy
Copy link

@snandhyala

In you earlier message where you have provided metric values as follows and point out that the first data point is incorrect.

1027.76
1291.04
1291.32
1292.56
1291.64

Which CloudWatch metric do these values correspond to? Are these the values of the metric computed using metric math expressions, that is, the metric with the ID e1 from the ExternalMetric spec you shared earlier?

You also mention that when you used CloudWatch API, you got the correct values?
Did you use the aws cloudwatch get-metric-data API with the exact same metric math expression?

@snandhyala
Copy link
Author

snandhyala commented Mar 30, 2021 via email

@vijayansarathy
Copy link

vijayansarathy commented Mar 31, 2021

@snandhyala

Thanks for the clarification.

What I was trying to ascertain was whether there was any issue with the metrics aggregation step that is done within the CloudWatch service. That is the impression I got from the meeting today but it seems that that is not the case.

Say, you are running the following command:

aws cloudwatch get-metric-data --cli-input-json file://metric-data-query.json

and the JSON file metric-data-query.json referenced above has the following metric math expression:

[
    {
        "Id": "m1",
        "Expression": "SEARCH({'AWS/SQS',cluster} MetricName='xxxx-running-pods','Average', 60)",
        "Label": "xxxx-kube-pods-running",
        "ReturnData": false,
        "Period": 60
    },    
    {
        "Id": "m2",
        "MetricStat": {
            "Metric": {
                "Namespace": "AWS/SQS",
                "MetricName": "NumberOfMessagesSent",
                "Dimensions": [
                    {
                        "Name": "QueueName",
                        "Value": "internal-device-sqs"
                    }
                ]
            },
            "Period": 60,
            "Stat": "Sum"
        },
        "ReturnData": false
    },
    {
        "Id": "e1",
        "Expression": "m2 / m1",
        "Label": "avg-number-of-messages-each-pod",
        "ReturnData": true,
        "Period": 60
    },    
]

If the above CLI command gives you the right values, then the conclusion will be that there is nothing wrong with the original metric data in CloudWatch per se nor anything to do with the way it does the aggregation using the SUM operator.

If the values output by the CloudWatch Metrics Adapter for Kubernetes are not consistent with the results returned from the CLI response, then the issue has to be with the adapter. The adapter is hard-coded to retrieve the metric data for the trailing 5 minutes. It will fetch the metric values at the top of every minute for the past 5 minutes. Are you seeing a behavior where the metric value that it reports for the most recent minute is always incorrect? It appears that the SQS metrics published to CloudWatch are of standard resolution, with data having a one-minute granularity. You are using the NumberOfMessagesSent which has a caveat attached to it. Read the footnote at the bottom of the table in the documentation.

All that said, my suggestion would be to check if the adapter works in a predictable manner if you fetch the metric data up to, say, 2 minutes before the current time. That is, if you are fetching data at 10:00 AM, then gather metric values for the data points at 9:54, 9:55, 9:56, 9:57, 9:58. Just a suggestion. Of course, you will need to modify the adapter code to do that. I have pointed out exactly where you have to make the change. So, you may try rebuilding you version locally and check. Unfortunately, you are not going to get any official support for AWS for issues pertaining to this adapter.

@devoprock
Copy link

@vijayansarathy , Thanks for details, So to fix this issue we need to change cloudwatch adapter code to get metrics older than (-2mins) data point from cloudwatch ?

@devoprock
Copy link

@vijayansarathy ,client.go code is using aws module package ,Is there anyway get pass parameter in aws sdk get olderthen mins datapoint from cloudwatch ? As per my knowledge, we have only option is to pass endtime. Is there any better way passing dynamic system time - 2min on aws adk module.

@vijayansarathy
Copy link

That is a suggestion. I am curious to see what the computed metric values looks like if you drop the most recent minute and go further backwards from there.

Separately, I wanted to ask you the rationale behind using the metric NumberOfMessagesSent for autoscaling. Given that it is a counter, you will have to convert it into a rate/utilization metric which is what has led you down this path.

But how about the metric ApproximateNumberOfMessagesVisible? I have seen customers use this more often for scaling out message consumers/subscribers in a fan-out scenario. If you use this, you will not have to calculate a rate metric like you do no. You will have to use the number of Pods to calculate the average number of messages visible per pod. Then, you can simply set a target value of, say, 500 and the HPA will scale the number of Pods out until (ApproximateNumberOfMessagesVisible/NPods) is less then 500. Would that not work?

@vijayansarathy
Copy link

The GetMetricData API in AWS SDK for CloudWatch does support specifying StartTime and EndTime.
Unfortunately, the adapter is not exposing this but is hard-coding it to the trailing 5 minutes.
It is a very simple code change to make it go for from, say, (T-2) to (T-7), where T is the current time.

@devoprock
Copy link

@vijayansarathy , In starting HPA Autoscaling we are tried to use ApproximateNumberOfMessagesVisible and noticed pods are started more flapping in a short period (here we can try to use behavior option to control) to our huge load. Here our main goal is how can use cloudwatch adpater for other type of metrics on eks clusters.

@devoprock
Copy link

@vijayansarathy ,As per my understanding we need to change below code as example:


now := time.Now()
endTime := time.Date(now.Year(), now.Month(), now.Day(), now.Hour(), now.Minute(), 0, 0, now.Location())
// CloudWatch metrics have latency, we will grab in a 5 minute window and extract the latest value
startTime := endTime.Add(-5 * time.Minute)

Change to login "now=" as below on above adapter code.

import time
print time.ctime()
Tue Mar 30 19:48:41 2021
pasttime = time.time() - 120
print time.ctime(future)
Tue Mar 30 19:46:49 2021

@vijayansarathy
Copy link

now := time.Now()
endTime := time.Date(now.Year(), now.Month(), now.Day(), now.Hour(), now.Minute(), 0, 0, now.Location())
endTime = endTime.Add(-2 * time.Minute)
startTime := endTime.Add(-5 * time.Minute)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants