-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MON-3920: add runbook for PrometheusKubernetesListWatchFailures #225
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# PrometheusKubernetesListWatchFailures | ||
|
||
## Meaning | ||
|
||
The `PrometheusKubernetesListWatchFailures` alert is triggered when there is a constant | ||
increase in failures with `LIST/WATCH` requests to the Kubernetes API during Prometheus target | ||
discovery. | ||
|
||
## Impact | ||
|
||
This may prevent Prometheus from adding, updating, or deleting targets effectively. | ||
|
||
## Diagnosis | ||
|
||
Determine whether the alert has triggered for the instance of Prometheus used | ||
for default cluster monitoring or for the instance that monitors user-defined | ||
projects by viewing the alert message's `namespace` label: the namespace for | ||
default cluster monitoring is `openshift-monitoring` and the namespace for | ||
user workload monitoring is `openshift-user-workload-monitoring`. | ||
|
||
## Mitigation | ||
|
||
To gain further insight, review the logs of the affected Prometheus instance: | ||
|
||
```shell | ||
$ NAMESPACE='<value of namespace label from alert>' | ||
|
||
$ oc -n $NAMESPACE logs -c prometheus -l 'app.kubernetes.io/name=prometheus' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI, this command only output 20-line logs, maybe the error is not in the output, maybe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using the label is meant to get the logs from both pods without having to specify their names. You aren't seeing any logs even though There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant the command only output 20-line logs, the error maybe not in the 20-line logs. |
||
``` | ||
|
||
The issue may arise from one or both of the following scenarios: | ||
|
||
### Insufficient RBAC Permissions for Prometheus | ||
|
||
In this scenario, Prometheus has been tasked to discover targets in a specified namespace, likely | ||
through `ServiceMonitor` or `PodMonitor` resources, yet Prometheus lacks the necessary RBAC | ||
permissions to query `Service`, `Endpoints`, `Pod`, and other related resources where the targets | ||
are defined, the following log messages may be observed: | ||
|
||
``` | ||
ts=2024-11-13T07:09:51.190Z caller=klog.go:108 level=warn component=k8s_client_runtime | ||
func=Warningf msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:554: | ||
failed to list *v1.Endpoints: endpoints is forbidden: | ||
User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource | ||
\"endpoints\" in API group \"\" in the namespace \"foo\"" | ||
``` | ||
|
||
To rectify this issue, ensure Prometheus is granted the necessary RBAC permissions as detailed in | ||
the following guide: [Configuring Prometheus to scrape metrics]. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason to not just have the link here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With this we can exceed markdownlint's line-length There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (otherwise, we cannot split the URL) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah I get it. Not the biggest fan of working around it this way, but what can you do. I'll see if there is a lint exception. lgtm in the mean time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's legitimate markdown though that's already used in other places (https://github.com/search?q=repo%3Aopenshift%2Frunbooks+%22%5D%3A+%22&type=code), for me it's better than using an exception... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh that actually gets rendered. TIL, sorry for the noise. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC the permission issue can only happen for platform Prometheus and it's either a misconfiguration from the certified operator (product bug) or a user-defined service/pod monitor deployed in a plaform namespace (unsupported config). User-defined Prometheus should have full permissions by default. The other cause could be a partial/complete outage of the Kubernetes API. |
||
|
||
--- | ||
|
||
If you cannot resolve the issue, log in to the | ||
[Customer Portal](https://access.redhat.com) and open a support case, | ||
attaching the artifacts gathered during the diagnosis procedure. | ||
|
||
[Configuring Prometheus to scrape metrics]: https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/#configuring-prometheus-to-scrape-metrics | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we still in the diagnosis phase?