-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-45924: add a monitor test to detect concurrent installer pod or static pod #29462
Conversation
/retest |
b041550
to
2ea9f20
Compare
/payload 4.19 nightly informing |
@tkashem: trigger 67 job(s) of type informing for the nightly release of OCP 4.19
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/381fdcf0-d92e-11ef-8b09-a0c863e1b4f6-0 |
|
||
func (mt *monitorTest) EvaluateTestsFromConstructedIntervals(ctx context.Context, finalIntervals monitorapi.Intervals) ([]*junitapi.JUnitTestCase, error) { | ||
junitTest := &junitTest{ | ||
name: "[sig-apimachinery] installer Pods should not run concurrently on two or more node", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
little cleaner to lower case pods, and pluralize node.
} | ||
for _, interval := range concurrent { | ||
failed.FailureOutput.Output = fmt.Sprintf("%s\n%s", failed.FailureOutput.Output, interval.String()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to help find flakes, though it will need adjusting if you change the test name: https://sippy.dptools.openshift.org/sippy-ng/tests/Presubmits/analysis?test=%5Bsig-apimachinery%5D%20installer%20Pods%20should%20not%20run%20concurrently%20on%20two%20or%20more%20node&filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22%5Bsig-apimachinery%5D%20installer%20Pods%20should%20not%20run%20concurrently%20on%20two%20or%20more%20node%22%7D%2C%7B%22columnField%22%3A%22variants%22%2C%22not%22%3Atrue%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22never-stable%22%7D%2C%7B%22columnField%22%3A%22variants%22%2C%22not%22%3Atrue%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22aggregated%22%7D%5D%2C%22linkOperator%22%3A%22and%22%7D
Very high flake rate it looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the sippy URL, almost all of these flakes are from earlier runs with the buggy version of this PR, the newer runs from today do not have these flakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes sense, I followed the link to the last failure.
@@ -487,6 +495,9 @@ <h5 class="modal-title">Resource</h5> | |||
timelineGroups.push({group: "api-unreachable", data: []}) | |||
createTimelineData(isAPIUnreachableFromClientValue, timelineGroups[timelineGroups.length - 1].data, eventIntervals, isAPIUnreachableFromClientActivity, regex) | |||
|
|||
timelineGroups.push({group: "staticpod-install", data: []}) | |||
createTimelineData(isStaticPodInstallMonitorValue, timelineGroups[timelineGroups.length - 1].data, eventIntervals, isStaticPodInstallMonitorActivity, regex) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the solid blue bars that look to be overlapping with a darker shade of blue what you would expect here: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29462/pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade/1881869312688394240
Almost looks like locators are overlapping, but also is it supposed to show overlap for all that time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this is from a run with an earlier version of this PR, I have fixed the issue, any run from today should not have the large blue bar.
2ea9f20
to
3eece60
Compare
/payload 4.19 nightly informing |
@tkashem: trigger 67 job(s) of type informing for the nightly release of OCP 4.19
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b0796a60-d9e5-11ef-9bce-7b4be61e19fb-0 |
accummulated := monitorapi.Intervals{} | ||
for _, parser := range p { | ||
intervals, handled := parser.Parse(node, line) | ||
accummulated = append(accummulated, intervals...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if (in the future) either of Parse implementations accidentally returns non-nil intervals
and handled=false
? Will accummulated
be still a valid list of intervals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can tweak it in the future, forcefully run all parsers, we also want to streamline all the existing parsers -
origin/pkg/monitortests/node/kubeletlogcollector/node.go
Lines 96 to 106 in 5b828df
ret = append(ret, readinessFailure(nodeName, currLine)...) | |
ret = append(ret, readinessError(nodeName, currLine)...) | |
ret = append(ret, statusHttpClientConnectionLostError(nodeName, currLine)...) | |
ret = append(ret, reflectorHttpClientConnectionLostError(nodeName, currLine)...) | |
ret = append(ret, kubeletNodeHttpClientConnectionLostError(nodeName, currLine)...) | |
ret = append(ret, startupProbeError(nodeName, currLine)...) | |
ret = append(ret, errParsingSignature(nodeName, currLine)...) | |
ret = append(ret, failedToDeleteCGroupsPath(nodeLocator, currLine)...) | |
ret = append(ret, anonymousCertConnectionError(nodeLocator, currLine)...) | |
ret = append(ret, leaseUpdateError(nodeLocator, currLine)...) | |
ret = append(ret, leaseFailBackOff(nodeLocator, currLine)...) |
I had to move the two parsers in this PR n their own packages so I could write some tests. to your question, right now, if that happens the existing tests will fail, so we have some protection :)
/approve Feel free to get someone on your team to lgtm, the core things we look for are all good in here. |
This was a good learning opportunity. This is a great and very helpful start. Let's have this sit around for a while to observe a new signal in the CI jobs. Thank you Abu. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, ingvagabund, tkashem The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
CI improvements, the new test will only flake in the worst case. |
/retitle no-jira: add a monitor test to detect concurrent installer pod or static pod |
@tkashem: This pull request explicitly references no jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
// the following constraints define pass/fail for this test: | ||
// a) if we don't find any constructed/computed interval, then | ||
// this test is a noop, so we mark the test as skipped | ||
// b) we find constructed/computed intervals, but no occurrences of | ||
// concurrent situation, this test is a pass | ||
// c) otherwise, there is at least one incident of a | ||
// concurrent situation, this test is a flake/fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How hard would it be to make the test fail when a logging change causes either parser to observe nothing? If we don't see a single PLEG or probe log for any container (not limited to installer pods), would that be a reliable signal that the logs we're looking for have changed somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to streamline this for other parsers as well, we will do a follow up PR for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I know this is common to all the "log grepping" tests. I'd be happy to see an issue that describes the problem instead of delaying this test, which is useful immediately.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold |
/hold cancel |
/cherry-pick release-4.18 |
@benluddy: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/retest-required |
1 similar comment
/retest-required |
/shrug |
/test e2e-gcp-ovn-rt-upgrade |
/test |
@tkashem: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/retest |
/retest-required |
/test all |
/retest-required |
Job Failure Risk Analysis for sha: 3eece60
|
@tkashem: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
2909253
into
openshift:master
@benluddy: new pull request created: #29480 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[ART PR BUILD NOTIFIER] Distgit: openshift-enterprise-tests |
@tkashem: Jira Issue OCPBUGS-45924 is in an unrecognized state (ON_QA) and will not be moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/cherry-pick release-4.17 |
@tkashem: #29462 failed to apply on top of branch "release-4.17":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The monitor test works as follows:
a
b
and construct/compute new intervals:This is an example, blue is installer pod duration, the red is etcd static pod unready window:
from: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29462/pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade/1882095583997464576
The monitor test also flakes if: