Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MON-3920: add runbook for PrometheusKubernetesListWatchFailures #225

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

machine424
Copy link
Contributor

No description provided.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 26, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 26, 2024

@machine424: This pull request references MON-3920 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Nov 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: machine424

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 26, 2024
@machine424
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 26, 2024

@machine424: This pull request references MON-3920 which is a valid jira issue.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Nov 26, 2024

@machine424: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint b02e4a4 link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@juzhao
Copy link

juzhao commented Nov 27, 2024

ci/prow/markdownlint job failed for exceed line length

Summary: 7 error(s)
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:6:81 MD013/line-length Line length [Expected: 80; Actual: 94]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:35:81 MD013/line-length Line length [Expected: 80; Actual: 97]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:36:81 MD013/line-length Line length [Expected: 80; Actual: 91]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:37:98 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:37:81 MD013/line-length Line length [Expected: 80; Actual: 98]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:40 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:48:81 MD013/line-length Line length [Expected: 80; Actual: 97]

```shell
$ NAMESPACE='<value of namespace label from alert>'

$ oc -n $NAMESPACE logs -c prometheus -l 'app.kubernetes.io/name=prometheus'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this command only output 20-line logs, maybe the error is not in the output, maybe
$ oc -n $NAMESPACE logs -c prometheus ${prometheus_pod}
is better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the label is meant to get the logs from both pods without having to specify their names.

You aren't seeing any logs even though PrometheusKubernetesListWatchFailures is firing? there should be no logs when everything is fine.

Copy link

@juzhao juzhao Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the command only output 20-line logs, the error maybe not in the 20-line logs.
default loglevel for prometheus is info, there are many logs before we see the error, example: https://privatebin.corp.redhat.com/?545d07abffd73da8#HhcRUpPiLk5apApkEDsKkCCYcCjixMbUbx3tUpGHxUup
oc -n $NAMESPACE logs -c prometheus -l 'app.kubernetes.io/name=prometheus' --tail=-1 will show all logs

```

To rectify this issue, ensure Prometheus is granted the necessary RBAC permissions as detailed in
the following guide: [Configuring Prometheus to scrape metrics].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not just have the link here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this we can exceed markdownlint's line-length

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(otherwise, we cannot split the URL)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I get it. Not the biggest fan of working around it this way, but what can you do. I'll see if there is a lint exception. lgtm in the mean time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's legitimate markdown though that's already used in other places (https://github.com/search?q=repo%3Aopenshift%2Frunbooks+%22%5D%3A+%22&type=code), for me it's better than using an exception...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that actually gets rendered. TIL, sorry for the noise.

@machine424
Copy link
Contributor Author

ci/prow/markdownlint job failed for exceed line length

Summary: 7 error(s)
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:6:81 MD013/line-length Line length [Expected: 80; Actual: 94]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:35:81 MD013/line-length Line length [Expected: 80; Actual: 97]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:36:81 MD013/line-length Line length [Expected: 80; Actual: 91]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:37:98 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:37:81 MD013/line-length Line length [Expected: 80; Actual: 98]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:40 MD040/fenced-code-language Fenced code blocks should have a language specified [Context: "```"]
alerts/cluster-monitoring-operator/PrometheusKubernetesListWatchFailures.md:48:81 MD013/line-length Line length [Expected: 80; Actual: 97]

Yep, intended to fix that as part of the review suggestions, as I'm sure @eromanova97 have some ;)


## Mitigation

To gain further insight, review the logs of the affected Prometheus instance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we still in the diagnosis phase?

```

To rectify this issue, ensure Prometheus is granted the necessary RBAC permissions as detailed in
the following guide: [Configuring Prometheus to scrape metrics].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the permission issue can only happen for platform Prometheus and it's either a misconfiguration from the certified operator (product bug) or a user-defined service/pod monitor deployed in a plaform namespace (unsupported config). User-defined Prometheus should have full permissions by default.

The other cause could be a partial/complete outage of the Kubernetes API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants