Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-45924: add a monitor test to detect concurrent installer pod or static pod #29462

Merged
merged 2 commits into from
Jan 25, 2025

Conversation

tkashem
Copy link
Contributor

@tkashem tkashem commented Jan 22, 2025

The monitor test works as follows:

  • a) parse the kubelet log for SyncLoop lines (both PLEG and probe lines)
  • b) generate events/interval for each of these lines from a
  • c) add a monitor test that inspects the events from b and construct/compute new intervals:
    • etcd installer pod duration (derived from PLEG container start, and exit )
    • static pod (etcd for now) unready interval (derived from SyncLoop probe events)

This is an example, blue is installer pod duration, the red is etcd static pod unready window:
image

from: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29462/pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade/1882095583997464576

The monitor test also flakes if:

  • it finds two concurrent installer pods running on separate nodes
  • it finds two concurrent unready window for the etcd static pod on separate nodes, which could potentially lead to etcd quorum loss

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 22, 2025
@openshift-ci openshift-ci bot requested review from deads2k and p0lyn0mial January 22, 2025 01:00
@tkashem
Copy link
Contributor Author

tkashem commented Jan 22, 2025

/retest

@tkashem tkashem force-pushed the mt-static-pod branch 2 times, most recently from b041550 to 2ea9f20 Compare January 22, 2025 15:57
@tkashem tkashem changed the title [WIP] monitor test to detect concurrent installer pod or static pod add a monitor test to detect concurrent installer pod or static pod Jan 22, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 22, 2025
@tkashem
Copy link
Contributor Author

tkashem commented Jan 23, 2025

/payload 4.19 nightly informing

Copy link
Contributor

openshift-ci bot commented Jan 23, 2025

@tkashem: trigger 67 job(s) of type informing for the nightly release of OCP 4.19

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-compact-fips
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-single-node-ipv6
  • periodic-ci-openshift-release-master-nightly-4.19-console-aws
  • periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-aws
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-csi
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-cgroupsv2
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-fips
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-csi
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-serial
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-techpreview-serial
  • periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-upgrade-out-of-change
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-upi
  • periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-azure
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-csi
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-upgrade-out-of-change
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-driver-toolkit
  • periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-gcp
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-csi
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-rt
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-bm-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-serial-ipv4
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-serial-virtualmedia
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-metal-ipi-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-serial-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-serial-ovn-dualstack
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-upgrade-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-metal-ipi-upgrade-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ovn-assisted
  • periodic-ci-openshift-release-master-nightly-4.19-metal-ovn-single-node-recert-cluster-rename
  • periodic-ci-openshift-osde2e-main-nightly-4.19-osd-aws
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-osd-ccs-gcp
  • periodic-ci-openshift-osde2e-main-nightly-4.19-osd-gcp
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-proxy
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ovn-single-node-live-iso
  • periodic-ci-openshift-osde2e-main-nightly-4.19-rosa-classic-sts
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-hypershift-ovn
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-telco5g
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-csi
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-serial
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-vsphere-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-vsphere-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-upi
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-upi-serial
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-static-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/381fdcf0-d92e-11ef-8b09-a0c863e1b4f6-0


func (mt *monitorTest) EvaluateTestsFromConstructedIntervals(ctx context.Context, finalIntervals monitorapi.Intervals) ([]*junitapi.JUnitTestCase, error) {
junitTest := &junitTest{
name: "[sig-apimachinery] installer Pods should not run concurrently on two or more node",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

little cleaner to lower case pods, and pluralize node.

}
for _, interval := range concurrent {
failed.FailureOutput.Output = fmt.Sprintf("%s\n%s", failed.FailureOutput.Output, interval.String())
}
Copy link
Contributor Author

@tkashem tkashem Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the sippy URL, almost all of these flakes are from earlier runs with the buggy version of this PR, the newer runs from today do not have these flakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense, I followed the link to the last failure.

@@ -487,6 +495,9 @@ <h5 class="modal-title">Resource</h5>
timelineGroups.push({group: "api-unreachable", data: []})
createTimelineData(isAPIUnreachableFromClientValue, timelineGroups[timelineGroups.length - 1].data, eventIntervals, isAPIUnreachableFromClientActivity, regex)

timelineGroups.push({group: "staticpod-install", data: []})
createTimelineData(isStaticPodInstallMonitorValue, timelineGroups[timelineGroups.length - 1].data, eventIntervals, isStaticPodInstallMonitorActivity, regex)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the solid blue bars that look to be overlapping with a darker shade of blue what you would expect here: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29462/pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade/1881869312688394240

Almost looks like locators are overlapping, but also is it supposed to show overlap for all that time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this is from a run with an earlier version of this PR, I have fixed the issue, any run from today should not have the large blue bar.

@tkashem
Copy link
Contributor Author

tkashem commented Jan 23, 2025

/payload 4.19 nightly informing

Copy link
Contributor

openshift-ci bot commented Jan 23, 2025

@tkashem: trigger 67 job(s) of type informing for the nightly release of OCP 4.19

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-compact-fips
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-single-node-ipv6
  • periodic-ci-openshift-release-master-nightly-4.19-console-aws
  • periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-aws
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-csi
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-cgroupsv2
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-fips
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-csi
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-serial
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-techpreview-serial
  • periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-upgrade-out-of-change
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-upi
  • periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-azure
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-csi
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-upgrade-out-of-change
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-driver-toolkit
  • periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-gcp
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-csi
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-rt
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-bm-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-serial-ipv4
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-serial-virtualmedia
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-metal-ipi-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-serial-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-serial-ovn-dualstack
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-upgrade-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-metal-ipi-upgrade-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ovn-assisted
  • periodic-ci-openshift-release-master-nightly-4.19-metal-ovn-single-node-recert-cluster-rename
  • periodic-ci-openshift-osde2e-main-nightly-4.19-osd-aws
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-osd-ccs-gcp
  • periodic-ci-openshift-osde2e-main-nightly-4.19-osd-gcp
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-proxy
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ovn-single-node-live-iso
  • periodic-ci-openshift-osde2e-main-nightly-4.19-rosa-classic-sts
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-rosa-sts-hypershift-ovn
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-telco5g
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-csi
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-serial
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-vsphere-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-vsphere-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-upi
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-upi-serial
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-static-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b0796a60-d9e5-11ef-9bce-7b4be61e19fb-0

accummulated := monitorapi.Intervals{}
for _, parser := range p {
intervals, handled := parser.Parse(node, line)
accummulated = append(accummulated, intervals...)
Copy link
Member

@ingvagabund ingvagabund Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if (in the future) either of Parse implementations accidentally returns non-nil intervals and handled=false? Will accummulated be still a valid list of intervals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can tweak it in the future, forcefully run all parsers, we also want to streamline all the existing parsers -

ret = append(ret, readinessFailure(nodeName, currLine)...)
ret = append(ret, readinessError(nodeName, currLine)...)
ret = append(ret, statusHttpClientConnectionLostError(nodeName, currLine)...)
ret = append(ret, reflectorHttpClientConnectionLostError(nodeName, currLine)...)
ret = append(ret, kubeletNodeHttpClientConnectionLostError(nodeName, currLine)...)
ret = append(ret, startupProbeError(nodeName, currLine)...)
ret = append(ret, errParsingSignature(nodeName, currLine)...)
ret = append(ret, failedToDeleteCGroupsPath(nodeLocator, currLine)...)
ret = append(ret, anonymousCertConnectionError(nodeLocator, currLine)...)
ret = append(ret, leaseUpdateError(nodeLocator, currLine)...)
ret = append(ret, leaseFailBackOff(nodeLocator, currLine)...)

I had to move the two parsers in this PR n their own packages so I could write some tests. to your question, right now, if that happens the existing tests will fail, so we have some protection :)

@dgoodwin
Copy link
Contributor

/approve

Feel free to get someone on your team to lgtm, the core things we look for are all good in here.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2025
@ingvagabund
Copy link
Member

This was a good learning opportunity. This is a great and very helpful start. Let's have this sit around for a while to observe a new signal in the CI jobs. Thank you Abu.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 24, 2025
Copy link
Contributor

openshift-ci bot commented Jan 24, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, ingvagabund, tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ingvagabund
Copy link
Member

ingvagabund commented Jan 24, 2025

CI improvements, the new test will only flake in the worst case.
/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Jan 24, 2025
@ingvagabund
Copy link
Member

/retitle no-jira: add a monitor test to detect concurrent installer pod or static pod

@openshift-ci openshift-ci bot changed the title add a monitor test to detect concurrent installer pod or static pod no-jira: add a monitor test to detect concurrent installer pod or static pod Jan 24, 2025
@openshift-ci-robot
Copy link

@tkashem: This pull request explicitly references no jira issue.

In response to this:

The monitor test works as follows:

  • a) parse the kubelet log for SyncLoop lines (both PLEG and probe lines)
  • b) generate events/interval for each of these lines from a
  • c) add a monitor test that inspects the events from b and construct/compute new intervals:
    • etcd installer pod duration (derived from PLEG container start, and exit )
    • static pod (etcd for now) unready interval (derived from SyncLoop probe events)

This is an example, blue is installer pod duration, the red is etcd static pod unready window:
image

from: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29462/pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade/1882095583997464576

The monitor test also flakes if:

  • it finds two concurrent installer pods running on separate nodes
  • it finds two concurrent unready window for the etcd static pod on separate nodes, which could potentially lead to etcd quorum loss

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 24, 2025
Comment on lines +58 to +64
// the following constraints define pass/fail for this test:
// a) if we don't find any constructed/computed interval, then
// this test is a noop, so we mark the test as skipped
// b) we find constructed/computed intervals, but no occurrences of
// concurrent situation, this test is a pass
// c) otherwise, there is at least one incident of a
// concurrent situation, this test is a flake/fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard would it be to make the test fail when a logging change causes either parser to observe nothing? If we don't see a single PLEG or probe log for any container (not limited to installer pods), would that be a reliable signal that the logs we're looking for have changed somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to streamline this for other parsers as well, we will do a follow up PR for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I know this is common to all the "log grepping" tests. I'd be happy to see an issue that describes the problem instead of delaying this test, which is useful immediately.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ingvagabund
Copy link
Member

/hold
In case others wanna review too.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 24, 2025
@benluddy
Copy link
Contributor

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 24, 2025
@benluddy
Copy link
Contributor

/cherry-pick release-4.18

@openshift-cherrypick-robot

@benluddy: once the present PR merges, I will cherry-pick it on top of release-4.18 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tkashem
Copy link
Contributor Author

tkashem commented Jan 24, 2025

/retest-required

1 similar comment
@tkashem
Copy link
Contributor Author

tkashem commented Jan 24, 2025

/retest-required

@benluddy
Copy link
Contributor

/shrug

@openshift-ci openshift-ci bot added the ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯ label Jan 24, 2025
@benluddy
Copy link
Contributor

/test e2e-gcp-ovn-rt-upgrade
/test e2e-metal-ipi-ovn
/test e2e-openstack-ovn
/test e2e-aws-ovn-kube-apiserver-rollout

@tkashem
Copy link
Contributor Author

tkashem commented Jan 24, 2025

/test

Copy link
Contributor

openshift-ci bot commented Jan 24, 2025

@tkashem: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test e2e-aws-jenkins
/test e2e-aws-ovn-edge-zones
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-image-registry
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial
/test e2e-gcp-ovn
/test e2e-gcp-ovn-builds
/test e2e-gcp-ovn-image-ecosystem
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test images
/test lint
/test unit
/test verify
/test verify-deps

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
/test e2e-agnostic-ovn-cmd
/test e2e-aws
/test e2e-aws-csi
/test e2e-aws-disruptive
/test e2e-aws-etcd-certrotation
/test e2e-aws-etcd-recovery
/test e2e-aws-ovn
/test e2e-aws-ovn-cgroupsv2
/test e2e-aws-ovn-etcd-scaling
/test e2e-aws-ovn-ipsec-serial
/test e2e-aws-ovn-kube-apiserver-rollout
/test e2e-aws-ovn-kubevirt
/test e2e-aws-ovn-single-node
/test e2e-aws-ovn-single-node-serial
/test e2e-aws-ovn-single-node-techpreview
/test e2e-aws-ovn-single-node-techpreview-serial
/test e2e-aws-ovn-single-node-upgrade
/test e2e-aws-ovn-upgrade
/test e2e-aws-ovn-upgrade-rollback
/test e2e-aws-ovn-upi
/test e2e-aws-ovn-virt-techpreview
/test e2e-aws-proxy
/test e2e-azure
/test e2e-azure-ovn-etcd-scaling
/test e2e-azure-ovn-upgrade
/test e2e-baremetalds-kubevirt
/test e2e-external-aws
/test e2e-external-aws-ccm
/test e2e-external-vsphere-ccm
/test e2e-gcp-csi
/test e2e-gcp-disruptive
/test e2e-gcp-fips-serial
/test e2e-gcp-ovn-etcd-scaling
/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-ovn-techpreview
/test e2e-gcp-ovn-techpreview-serial
/test e2e-hypershift-conformance
/test e2e-metal-ipi-ovn
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-dualstack-local-gateway
/test e2e-metal-ipi-ovn-kube-apiserver-rollout
/test e2e-metal-ipi-serial
/test e2e-metal-ipi-serial-ovn-ipv6
/test e2e-metal-ipi-virtualmedia
/test e2e-metal-ovn-single-node-live-iso
/test e2e-metal-ovn-single-node-with-worker-live-iso
/test e2e-openstack-ovn
/test e2e-openstack-serial
/test e2e-vsphere
/test e2e-vsphere-ovn-dualstack-primaryv6
/test e2e-vsphere-ovn-etcd-scaling
/test okd-e2e-gcp
/test okd-scos-e2e-aws-ovn
/test okd-scos-images

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd
pull-ci-openshift-origin-master-e2e-aws-csi
pull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2
pull-ci-openshift-origin-master-e2e-aws-ovn-edge-zones
pull-ci-openshift-origin-master-e2e-aws-ovn-fips
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout
pull-ci-openshift-origin-master-e2e-aws-ovn-microshift
pull-ci-openshift-origin-master-e2e-aws-ovn-microshift-serial
pull-ci-openshift-origin-master-e2e-aws-ovn-serial
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade
pull-ci-openshift-origin-master-e2e-gcp-csi
pull-ci-openshift-origin-master-e2e-gcp-ovn
pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade
pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade
pull-ci-openshift-origin-master-e2e-hypershift-conformance
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout
pull-ci-openshift-origin-master-e2e-openstack-ovn
pull-ci-openshift-origin-master-images
pull-ci-openshift-origin-master-lint
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn
pull-ci-openshift-origin-master-unit
pull-ci-openshift-origin-master-verify
pull-ci-openshift-origin-master-verify-deps

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tkashem
Copy link
Contributor Author

tkashem commented Jan 24, 2025

/retest

@tkashem
Copy link
Contributor Author

tkashem commented Jan 24, 2025

/retest-required

@tkashem
Copy link
Contributor Author

tkashem commented Jan 24, 2025

/test all

@tkashem
Copy link
Contributor Author

tkashem commented Jan 25, 2025

/retest-required

Copy link

openshift-trt bot commented Jan 25, 2025

Job Failure Risk Analysis for sha: 3eece60

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 14.29% of 7 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

Copy link
Contributor

openshift-ci bot commented Jan 25, 2025

@tkashem: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn-rt-upgrade 3eece60 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn-single-node-upgrade 3eece60 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-metal-ipi-ovn 3eece60 link false /test e2e-metal-ipi-ovn
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 3eece60 link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/okd-scos-e2e-aws-ovn 3eece60 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 2909253 into openshift:master Jan 25, 2025
24 of 29 checks passed
@openshift-cherrypick-robot

@benluddy: new pull request created: #29480

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-tests
This PR has been included in build openshift-enterprise-tests-container-v4.19.0-202501251809.p0.g2909253.assembly.stream.el9.
All builds following this will include this PR.

@tkashem tkashem changed the title no-jira: add a monitor test to detect concurrent installer pod or static pod OCPBUGS-45924: add a monitor test to detect concurrent installer pod or static pod Jan 27, 2025
@openshift-ci-robot
Copy link

@tkashem: Jira Issue OCPBUGS-45924 is in an unrecognized state (ON_QA) and will not be moved to the MODIFIED state.

In response to this:

The monitor test works as follows:

  • a) parse the kubelet log for SyncLoop lines (both PLEG and probe lines)
  • b) generate events/interval for each of these lines from a
  • c) add a monitor test that inspects the events from b and construct/compute new intervals:
    • etcd installer pod duration (derived from PLEG container start, and exit )
    • static pod (etcd for now) unready interval (derived from SyncLoop probe events)

This is an example, blue is installer pod duration, the red is etcd static pod unready window:
image

from: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29462/pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade/1882095583997464576

The monitor test also flakes if:

  • it finds two concurrent installer pods running on separate nodes
  • it finds two concurrent unready window for the etcd static pod on separate nodes, which could potentially lead to etcd quorum loss

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tkashem
Copy link
Contributor Author

tkashem commented Jan 27, 2025

/cherry-pick release-4.17

@openshift-cherrypick-robot

@tkashem: #29462 failed to apply on top of branch "release-4.17":

Applying: monitor static pod install by parsing kubelet logs
Using index info to reconstruct a base tree...
M	pkg/monitor/monitorapi/construction.go
M	pkg/monitor/monitorapi/types.go
M	pkg/monitortests/node/kubeletlogcollector/node.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/monitortests/node/kubeletlogcollector/node.go
CONFLICT (content): Merge conflict in pkg/monitortests/node/kubeletlogcollector/node.go
Auto-merging pkg/monitor/monitorapi/types.go
CONFLICT (content): Merge conflict in pkg/monitor/monitorapi/types.go
Auto-merging pkg/monitor/monitorapi/construction.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 monitor static pod install by parsing kubelet logs

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. ¯\_(ツ)_/¯ ¯\\\_(ツ)_/¯
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants