-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MON-3920: add runbook for PrometheusKubernetesListWatchFailures #225
base: master
Are you sure you want to change the base?
Conversation
@machine424: This pull request references MON-3920 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.19.0" version, but no target version was set. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: machine424 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/jira refresh |
@machine424: This pull request references MON-3920 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@machine424: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
ci/prow/markdownlint job failed for exceed line length
|
```shell | ||
$ NAMESPACE='<value of namespace label from alert>' | ||
|
||
$ oc -n $NAMESPACE logs -c prometheus -l 'app.kubernetes.io/name=prometheus' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this command only output 20-line logs, maybe the error is not in the output, maybe
$ oc -n $NAMESPACE logs -c prometheus ${prometheus_pod}
is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the label is meant to get the logs from both pods without having to specify their names.
You aren't seeing any logs even though PrometheusKubernetesListWatchFailures
is firing? there should be no logs when everything is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the command only output 20-line logs, the error maybe not in the 20-line logs.
default loglevel for prometheus is info, there are many logs before we see the error, example: https://privatebin.corp.redhat.com/?545d07abffd73da8#HhcRUpPiLk5apApkEDsKkCCYcCjixMbUbx3tUpGHxUup
oc -n $NAMESPACE logs -c prometheus -l 'app.kubernetes.io/name=prometheus' --tail=-1
will show all logs
``` | ||
|
||
To rectify this issue, ensure Prometheus is granted the necessary RBAC permissions as detailed in | ||
the following guide: [Configuring Prometheus to scrape metrics]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not just have the link here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this we can exceed markdownlint's line-length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(otherwise, we cannot split the URL)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I get it. Not the biggest fan of working around it this way, but what can you do. I'll see if there is a lint exception. lgtm in the mean time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's legitimate markdown though that's already used in other places (https://github.com/search?q=repo%3Aopenshift%2Frunbooks+%22%5D%3A+%22&type=code), for me it's better than using an exception...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that actually gets rendered. TIL, sorry for the noise.
Yep, intended to fix that as part of the review suggestions, as I'm sure @eromanova97 have some ;) |
|
||
## Mitigation | ||
|
||
To gain further insight, review the logs of the affected Prometheus instance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we still in the diagnosis phase?
``` | ||
|
||
To rectify this issue, ensure Prometheus is granted the necessary RBAC permissions as detailed in | ||
the following guide: [Configuring Prometheus to scrape metrics]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the permission issue can only happen for platform Prometheus and it's either a misconfiguration from the certified operator (product bug) or a user-defined service/pod monitor deployed in a plaform namespace (unsupported config). User-defined Prometheus should have full permissions by default.
The other cause could be a partial/complete outage of the Kubernetes API.
No description provided.