Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metricbeat runs happily for a while and then starts getting 401 errors when fetching metrics #42307

Open
danfinn opened this issue Jan 14, 2025 · 14 comments
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@danfinn
Copy link

danfinn commented Jan 14, 2025

version 8.14.0 on k8s installed via helm as a daemonset on Azure AKS

We noticed recently that some of our clusters are missing metrics. It appears that metricbeat is getting a 401 from the k8s API (I think that is where the query goes to anyways) but the strange thing is that this doesn't happen right away. If I restart the pods, they will run for some amount of time, for example an hour, and then they will start showing the following error:

{"log.level":"error","@timestamp":"2025-01-14T19:51:34.027Z","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).fetch","file.name":"module/wrapper.go","file.line":256},"message":"Error fetching data for metricset kubernetes.volume: error doing HTTP request to fetch 'volume' Metricset data: HTTP error 401 in : 401 Unauthorized","service.name":"metricbeat","ecs.version":"1.6.0"}

This has been in place and working for quite some time. We have both linux and windows nodes that metricbeat runs on and both are seeing this error.

metricbeat config:

  metricbeat.yml: |
    setup:
      template:
        enabled: false
        #name: "${INDEX_TEMPLATE_NAME}"
        #pattern: "*-*-*-metrics-*"
        #fields: "fields.yml"
        #settings:
          #index.number_of_shards: 1
          #index.number_of_replicas: 1
          #index.lifecycle.name: "bmap_obsv_ilm_policy"
      ilm:
        enabled: true
      kibana:
        host: "${KIBANA_HOST:kibana.example.com:5601}"
        protocol: "${KIBANA_PROTOCOL:https}"
        ssl.verification_mode: "${KIBANA_VERIFYSSL:none}"
        username: "${KIBANA_USERNAME:kibana}"
        password: "${KIBANA_PASSWORD:welcome1}"

    logging:
      level: "${LOG_LEVEL:warning}"
      to_stderr: true
      json: true

    metricbeat.autodiscover:
      providers:
        - type: kubernetes
          hints.enabled: true

    output.elasticsearch:
      hosts: "[${ELASTICSEARCH_HOST}:9200]"
      protocol: "${ELASTICSEARCH_PROTOCOL:https}"
      username: "${ELASTICSEARCH_USERNAME:elastic}"
      ssl.verification_mode: "${ELASTICSEARCH_VERIFYSSL:none}"
      password: "${ELASTICSEARCH_PASSWORD:welcome1}"
      index: "${INDEX_NAME}-%{[agent.version]}-%{+yyyy.MM.dd}"
      allow_older_versions: true

    processors:
       - drop_event.when:
           or:
           - equals:
               kubernetes.namespace: "azure-sql-exporter"
           - equals:
               kubernetes.namespace: "azure-sql-operator-system"
           - equals:
               kubernetes.namespace: "akv2k8s"
           - equals:
               kubernetes.namespace: "monitoring"
           - equals:
               kubernetes.namespace: "deploys"
           - equals:
               kubernetes.namespace: "tigera-operator"
           - equals:
               kubernetes.namespace: "calico-system"
           - equals:
               kubernetes.namespace: "kube-system"
           - equals:
               kubernetes.namespace: "kube-public"
           - equals:
               kubernetes.namespace: "kube-node-lease"
           - equals:
               kubernetes.namespace: "gatekeeper-system"
           - equals:
               kubernetes.namespace: "ots"
           - equals:
               kubernetes.namespace: "grafana"

    metricbeat.modules:
    - module: kubernetes
      metricsets:
        - container
        - node
        - pod
        - system
        - volume
      period: 1m
      host: "${NODE_NAME}"
      hosts: ["https://${NODE_NAME}:10250"]
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      ssl.verification_mode: "none"
      processors:
      - add_kubernetes_metadata: ~

    - module: kubernetes
      enabled: true
      metricsets:
        - event

    - module: system
      period: 1m
      metricsets:
        - filesystem
        - fsstat
      processors:
      - drop_event.when.regexp:
          system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib)($|/)'

    - module: system
      period: 1m
      metricsets:
        - core
        - cpu
        - diskio
        - load
        - memory
        - network
        - process
        - process_summary
      processes: ['.*']
      process.include_top_n:
        by_cpu: 5
        by_memory: 5
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 14, 2025
@danfinn
Copy link
Author

danfinn commented Jan 15, 2025

it appears that the token is expiring, here are the logs from kublet:

Jan 15 16:25:29 aks-nodepool1-21451151-vmss0000JX kubelet[2923]: E0115 16:25:29.612546    2923 server.go:304] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has expired]"
Jan 15 16:26:32 aks-nodepool1-21451151-vmss0000JX kubelet[2923]: E0115 16:26:32.058221    2923 server.go:304] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has expired]"
Jan 15 16:27:32 aks-nodepool1-21451151-vmss0000JX kubelet[2923]: E0115 16:27:32.071532    2923 server.go:304] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has expired]"
Jan 15 16:28:33 aks-nodepool1-21451151-vmss0000JX kubelet[2923]: E0115 16:28:33.706188    2923 server.go:304] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has expired]"
Jan 15 16:29:33 aks-nodepool1-21451151-vmss0000JX kubelet[2923]: E0115 16:29:33.704566    2923 server.go:304] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has expired]"

@jlind23
Copy link
Collaborator

jlind23 commented Jan 15, 2025

@pkoutsovasilis @swiatekm would one of you be able to take a Quick Look a this?

@jlind23 jlind23 added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 15, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis
Copy link
Contributor

Hey @danfinn 👋 could you please share with me the link of the Helm chart you used to install metrcibeat and the respective (redacted) values? 🙂

@danfinn
Copy link
Author

danfinn commented Jan 15, 2025

Sure, this might be slightly complicated though because we are using an older helm chart that we maintain locally but with a docker image that is current. I could send you the template for the deployment if you'd like? I can tell you that we install this in the exact same way across all of our clusters and only 2 out of many are doing this. At this time I'm kind of suspecting this might be an Azure AKS issue and I do have a ticket open with them as well.

The metricbeat pods are mounting the token like so:

  - name: kube-api-access-bch74
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt

and I think this should cause the token to get rotated every hour? I have confirmed that this is what is happening in our environments where we aren't seeing this issue. I'm working now to see if that is or is not happening in our broken environments.

@danfinn
Copy link
Author

danfinn commented Jan 15, 2025

this sounds like it could potentially be related?

#40627

@danfinn
Copy link
Author

danfinn commented Jan 15, 2025

I should also note that we did recently upgrade from k8s 1.28.x to 1.30.4 but we did that across all of our clusters and so far as far as I know we are only seeing this issue in 2 of our dev clusters, the rest are still working fine. Nothing has changed related to metricbeat in quite some time.

@pkoutsovasilis
Copy link
Contributor

pkoutsovasilis commented Jan 15, 2025

So yes this issue seems indeed related but it is found only starting from 8.15.1 and later versions. Thus could you retest with such a version of metricbeat?

That said, I have to say that there is no official support for a Helm chart for beats or windows on kubernetes for beats. However, the config snippet you posted seems correct to me 🙂

cc @bturquet in case I miss something about the metricbeat kubernetes module

@danfinn
Copy link
Author

danfinn commented Jan 15, 2025

We are seeing this error on both our windows and our linux pods. The helm chart we are using is for metricbeat. I don't believe we customized it to support windows but I suppose it's possible, it's been in place since before I started here.

So it seems the token IS being rotated on the pods that are having issues which is not what I expected to see. At least the /var/run/secrets/kubernetes.io/serviceaccount/token is getting updated. Perhaps something is causing metricbeat to not re-read the token but that seems odd since we are running the same version everywhere.

I can test it with a later version of metricbeat. We are currently running elasticsearch 8.15.3 in dev, I can try that same version of metricbeat since it is on our dev clusters that we are seeing this issue.

@gizas
Copy link
Contributor

gizas commented Jan 16, 2025

Hello, I think customer is affected by an old issue where we have seen that token expiration lead to 401 errors. So the fix for kubelet (checking your config seems that you only use kubelet metrics right?) is in 8.15.1: The PR 4036 "Fix first HTTP 401 error when fetching metrics from the Kubelet API caused by a token update".

For your information, there were additional metricsets (kubernetes apiserver, kube-scheduler and the kubernetes-controller ) that use token authentication and were fixed later in 41910. This fix will be available on the next 8.16 or 8.17 or of course 8.18 release. Hope above helps

@danfinn
Copy link
Author

danfinn commented Jan 16, 2025

I deployed this to one of the dev environments that was seeing this issue and so far things are looking good. It's been an hour and a half and we haven't seen any of the 401 errors but I'm going to keep monitoring before rolling this out everywhere.

Any theories on why we only saw this on a couple clusters when we run the same version of metricbeat (and AKS) across maybe 20 or so clusters?

@gizas
Copy link
Contributor

gizas commented Jan 16, 2025

Any theories on why we only saw this on a couple clusters

How about different AKS versions? Since v1.30.3 the --service-account-extend-token-expiration=false is set(https://github.com/Azure/AKS/releases/tag/2024-06-09).

@danfinn
Copy link
Author

danfinn commented Jan 16, 2025

No, as I mentioned above they are all on 1.30.4 at this point.

@danfinn
Copy link
Author

danfinn commented Jan 16, 2025

Ah, I think it might actually be related to that. That doc says "set to false on OIDC issuer enabled clusters" and we have recently enabled workload identity on some clusters but not all of them yet and I think that may explain it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

5 participants