-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metricbeat runs happily for a while and then starts getting 401 errors when fetching metrics #42307
Comments
it appears that the token is expiring, here are the logs from kublet:
|
@pkoutsovasilis @swiatekm would one of you be able to take a Quick Look a this? |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Hey @danfinn 👋 could you please share with me the link of the Helm chart you used to install metrcibeat and the respective (redacted) values? 🙂 |
Sure, this might be slightly complicated though because we are using an older helm chart that we maintain locally but with a docker image that is current. I could send you the template for the deployment if you'd like? I can tell you that we install this in the exact same way across all of our clusters and only 2 out of many are doing this. At this time I'm kind of suspecting this might be an Azure AKS issue and I do have a ticket open with them as well. The metricbeat pods are mounting the token like so:
and I think this should cause the token to get rotated every hour? I have confirmed that this is what is happening in our environments where we aren't seeing this issue. I'm working now to see if that is or is not happening in our broken environments. |
this sounds like it could potentially be related? |
I should also note that we did recently upgrade from k8s 1.28.x to 1.30.4 but we did that across all of our clusters and so far as far as I know we are only seeing this issue in 2 of our dev clusters, the rest are still working fine. Nothing has changed related to metricbeat in quite some time. |
So yes this issue seems indeed related but it is found only starting from 8.15.1 and later versions. Thus could you retest with such a version of metricbeat? That said, I have to say that there is no official support for a Helm chart for beats or windows on kubernetes for beats. However, the config snippet you posted seems correct to me 🙂 cc @bturquet in case I miss something about the metricbeat kubernetes module |
We are seeing this error on both our windows and our linux pods. The helm chart we are using is for metricbeat. I don't believe we customized it to support windows but I suppose it's possible, it's been in place since before I started here. So it seems the token IS being rotated on the pods that are having issues which is not what I expected to see. At least the /var/run/secrets/kubernetes.io/serviceaccount/token is getting updated. Perhaps something is causing metricbeat to not re-read the token but that seems odd since we are running the same version everywhere. I can test it with a later version of metricbeat. We are currently running elasticsearch 8.15.3 in dev, I can try that same version of metricbeat since it is on our dev clusters that we are seeing this issue. |
Hello, I think customer is affected by an old issue where we have seen that token expiration lead to 401 errors. So the fix for kubelet (checking your config seems that you only use kubelet metrics right?) is in 8.15.1: The PR 4036 "Fix first HTTP 401 error when fetching metrics from the Kubelet API caused by a token update". For your information, there were additional metricsets (kubernetes apiserver, kube-scheduler and the kubernetes-controller ) that use token authentication and were fixed later in 41910. This fix will be available on the next 8.16 or 8.17 or of course 8.18 release. Hope above helps |
I deployed this to one of the dev environments that was seeing this issue and so far things are looking good. It's been an hour and a half and we haven't seen any of the 401 errors but I'm going to keep monitoring before rolling this out everywhere. Any theories on why we only saw this on a couple clusters when we run the same version of metricbeat (and AKS) across maybe 20 or so clusters? |
How about different AKS versions? Since v1.30.3 the --service-account-extend-token-expiration=false is set(https://github.com/Azure/AKS/releases/tag/2024-06-09). |
No, as I mentioned above they are all on 1.30.4 at this point. |
Ah, I think it might actually be related to that. That doc says "set to false on OIDC issuer enabled clusters" and we have recently enabled workload identity on some clusters but not all of them yet and I think that may explain it. |
version 8.14.0 on k8s installed via helm as a daemonset on Azure AKS
We noticed recently that some of our clusters are missing metrics. It appears that metricbeat is getting a 401 from the k8s API (I think that is where the query goes to anyways) but the strange thing is that this doesn't happen right away. If I restart the pods, they will run for some amount of time, for example an hour, and then they will start showing the following error:
This has been in place and working for quite some time. We have both linux and windows nodes that metricbeat runs on and both are seeing this error.
metricbeat config:
The text was updated successfully, but these errors were encountered: