MON-3920: add runbook for PrometheusKubernetesListWatchFailures #225

machine424 · 2024-11-26T10:58:59Z

No description provided.

openshift-ci-robot · 2024-11-26T10:59:04Z

@machine424: This pull request references MON-3920 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-11-26T10:59:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: machine424

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~alerts/cluster-monitoring-operator/OWNERS~~ [machine424]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

machine424 · 2024-11-26T10:59:40Z

/jira refresh

openshift-ci-robot · 2024-11-26T10:59:44Z

@machine424: This pull request references MON-3920 which is a valid jira issue.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-11-26T11:07:59Z

@machine424: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`b02e4a4`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

juzhao · 2024-11-27T06:52:26Z

ci/prow/markdownlint job failed for exceed line length

Summary: 7 error(s)
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:6:81 MD013/line-length Line length [Expected: 80; Actual: 94]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:35:81 MD013/line-length Line length [Expected: 80; Actual: 97]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:36:81 MD013/line-length Line length [Expected: 80; Actual: 91]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:37:98 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:37:81 MD013/line-length Line length [Expected: 80; Actual: 98]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:40 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:48:81 MD013/line-length Line length [Expected: 80; Actual: 97]

juzhao · 2024-11-27T09:20:25Z

alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md

+   ```shell
+   $ NAMESPACE='<value of namespace label from alert>'
+
+   $ oc -n $NAMESPACE logs -c prometheus -l 'app.kubernetes.io/name=prometheus'


FYI, this command only output 20-line logs, maybe the error is not in the output, maybe
$ oc -n $NAMESPACE logs -c prometheus ${prometheus_pod}
is better?

Using the label is meant to get the logs from both pods without having to specify their names.

You aren't seeing any logs even though PrometheusKubernetesListWatchFailures is firing? there should be no logs when everything is fine.

I meant the command only output 20-line logs, the error maybe not in the 20-line logs.
default loglevel for prometheus is info, there are many logs before we see the error, example: https://privatebin.corp.redhat.com/?545d07abffd73da8#HhcRUpPiLk5apApkEDsKkCCYcCjixMbUbx3tUpGHxUup
oc -n $NAMESPACE logs -c prometheus -l 'app.kubernetes.io/name=prometheus' --tail=-1 will show all logs

jan--f · 2024-11-27T09:32:49Z

alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md

+```
+
+To rectify this issue, ensure Prometheus is granted the necessary RBAC permissions as detailed in
+the following guide: [Configuring Prometheus to scrape metrics].


Any reason to not just have the link here?

With this we can exceed markdownlint's line-length

(otherwise, we cannot split the URL)

yeah I get it. Not the biggest fan of working around it this way, but what can you do. I'll see if there is a lint exception. lgtm in the mean time.

It's legitimate markdown though that's already used in other places (https://github.com/search?q=repo%3Aopenshift%2Frunbooks+%22%5D%3A+%22&type=code), for me it's better than using an exception...

Oh that actually gets rendered. TIL, sorry for the noise.

machine424 · 2024-11-27T09:37:45Z

ci/prow/markdownlint job failed for exceed line length

Summary: 7 error(s)
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:6:81 MD013/line-length Line length [Expected: 80; Actual: 94]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:35:81 MD013/line-length Line length [Expected: 80; Actual: 97]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:36:81 MD013/line-length Line length [Expected: 80; Actual: 91]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:37:98 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:37:81 MD013/line-length Line length [Expected: 80; Actual: 98]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:40 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:48:81 MD013/line-length Line length [Expected: 80; Actual: 97]

Yep, intended to fix that as part of the review suggestions, as I'm sure @eromanova97 have some ;)

simonpasquier · 2024-11-27T09:58:40Z

alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md

+
+## Mitigation
+
+To gain further insight, review the logs of the affected Prometheus instance:


Aren't we still in the diagnosis phase?

simonpasquier · 2024-11-27T10:02:42Z

alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md

+```
+
+To rectify this issue, ensure Prometheus is granted the necessary RBAC permissions as detailed in
+the following guide: [Configuring Prometheus to scrape metrics].


IIUC the permission issue can only happen for platform Prometheus and it's either a misconfiguration from the certified operator (product bug) or a user-defined service/pod monitor deployed in a plaform namespace (unsupported config). User-defined Prometheus should have full permissions by default.

The other cause could be a partial/complete outage of the Kubernetes API.

MON-3920: add runbook for PrometheusKubernetesListWatchFailures

b02e4a4

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 26, 2024

openshift-ci bot requested review from marioferh and simonpasquier November 26, 2024 10:59

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 26, 2024

juzhao reviewed Nov 27, 2024

View reviewed changes

jan--f reviewed Nov 27, 2024

View reviewed changes

simonpasquier reviewed Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MON-3920: add runbook for PrometheusKubernetesListWatchFailures #225

MON-3920: add runbook for PrometheusKubernetesListWatchFailures #225

machine424 commented Nov 26, 2024

openshift-ci-robot commented Nov 26, 2024 •

edited by openshift-ci bot

Loading

openshift-ci bot commented Nov 26, 2024

machine424 commented Nov 26, 2024

openshift-ci-robot commented Nov 26, 2024 •

edited by openshift-ci bot

Loading

openshift-ci bot commented Nov 26, 2024

juzhao commented Nov 27, 2024

juzhao Nov 27, 2024

machine424 Nov 27, 2024

juzhao Nov 27, 2024 •

edited

Loading

jan--f Nov 27, 2024

machine424 Nov 27, 2024

machine424 Nov 27, 2024

jan--f Nov 27, 2024

machine424 Nov 27, 2024

jan--f Nov 27, 2024

machine424 commented Nov 27, 2024

simonpasquier Nov 27, 2024

simonpasquier Nov 27, 2024


		## Mitigation

		To gain further insight, review the logs of the affected Prometheus instance:

MON-3920: add runbook for PrometheusKubernetesListWatchFailures #225

Are you sure you want to change the base?

MON-3920: add runbook for PrometheusKubernetesListWatchFailures #225

Conversation

machine424 commented Nov 26, 2024

openshift-ci-robot commented Nov 26, 2024 • edited by openshift-ci bot Loading

openshift-ci bot commented Nov 26, 2024

machine424 commented Nov 26, 2024

openshift-ci-robot commented Nov 26, 2024 • edited by openshift-ci bot Loading

openshift-ci bot commented Nov 26, 2024

juzhao commented Nov 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juzhao Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machine424 commented Nov 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Nov 26, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Nov 26, 2024 •

edited by openshift-ci bot

Loading

juzhao Nov 27, 2024 •

edited

Loading