Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MON-3920: add runbook for PrometheusKubernetesListWatchFailures #225

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# PrometheusKubernetesListWatchFailures

## Meaning

The `PrometheusKubernetesListWatchFailures` alert is triggered when there is a constant
increase in failures with `LIST/WATCH` requests to the Kubernetes API during Prometheus target
discovery.

## Impact

This may prevent Prometheus from adding, updating, or deleting targets effectively.

## Diagnosis

Determine whether the alert has triggered for the instance of Prometheus used
for default cluster monitoring or for the instance that monitors user-defined
projects by viewing the alert message's `namespace` label: the namespace for
default cluster monitoring is `openshift-monitoring` and the namespace for
user workload monitoring is `openshift-user-workload-monitoring`.

## Mitigation

To gain further insight, review the logs of the affected Prometheus instance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we still in the diagnosis phase?


```shell
$ NAMESPACE='<value of namespace label from alert>'

$ oc -n $NAMESPACE logs -c prometheus -l 'app.kubernetes.io/name=prometheus'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this command only output 20-line logs, maybe the error is not in the output, maybe
$ oc -n $NAMESPACE logs -c prometheus ${prometheus_pod}
is better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the label is meant to get the logs from both pods without having to specify their names.

You aren't seeing any logs even though PrometheusKubernetesListWatchFailures is firing? there should be no logs when everything is fine.

Copy link

@juzhao juzhao Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the command only output 20-line logs, the error maybe not in the 20-line logs.
default loglevel for prometheus is info, there are many logs before we see the error, example: https://privatebin.corp.redhat.com/?545d07abffd73da8#HhcRUpPiLk5apApkEDsKkCCYcCjixMbUbx3tUpGHxUup
oc -n $NAMESPACE logs -c prometheus -l 'app.kubernetes.io/name=prometheus' --tail=-1 will show all logs

```

The issue may arise from one or both of the following scenarios:

### Insufficient RBAC Permissions for Prometheus

In this scenario, Prometheus has been tasked to discover targets in a specified namespace, likely
through `ServiceMonitor` or `PodMonitor` resources, yet Prometheus lacks the necessary RBAC
permissions to query `Service`, `Endpoints`, `Pod`, and other related resources where the targets
are defined, the following log messages may be observed:

```
ts=2024-11-13T07:09:51.190Z caller=klog.go:108 level=warn component=k8s_client_runtime
func=Warningf msg="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:554:
failed to list *v1.Endpoints: endpoints is forbidden:
User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot list resource
\"endpoints\" in API group \"\" in the namespace \"foo\""
```

To rectify this issue, ensure Prometheus is granted the necessary RBAC permissions as detailed in
the following guide: [Configuring Prometheus to scrape metrics].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not just have the link here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this we can exceed markdownlint's line-length

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(otherwise, we cannot split the URL)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I get it. Not the biggest fan of working around it this way, but what can you do. I'll see if there is a lint exception. lgtm in the mean time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's legitimate markdown though that's already used in other places (https://github.com/search?q=repo%3Aopenshift%2Frunbooks+%22%5D%3A+%22&type=code), for me it's better than using an exception...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that actually gets rendered. TIL, sorry for the noise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the permission issue can only happen for platform Prometheus and it's either a misconfiguration from the certified operator (product bug) or a user-defined service/pod monitor deployed in a plaform namespace (unsupported config). User-defined Prometheus should have full permissions by default.

The other cause could be a partial/complete outage of the Kubernetes API.


---

If you cannot resolve the issue, log in to the
[Customer Portal](https://access.redhat.com) and open a support case,
attaching the artifacts gathered during the diagnosis procedure.

[Configuring Prometheus to scrape metrics]: https://rhobs-handbook.netlify.app/products/openshiftmonitoring/collecting_metrics.md/#configuring-prometheus-to-scrape-metrics