-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/openshift-install/create: Retry watch connections #606
cmd/openshift-install/create: Retry watch connections #606
Conversation
8e1bcbf
to
d2a110f
Compare
We might be seeing this when bootkube takes the server out of service on bootstrap node... |
Good link, and that suggests low NLB timeouts to address this issue. Can we do that easily? @crawford? |
Assorted references:
|
/test e2e-aws Exercising openshift/release#2062 |
@wking we should also tune our watch client to do very small dial timeouts. |
The NLBs are configured here. I believe the reason for the long timeout is so that clients don't get disconnected while waiting for events. |
Yeah, the long idle timeout should be fine. But we may want smaller target timeouts. |
Sounds like the load-balancer keepalive means dial timeouts don't help. As a temporary hack, how about a 5-minute timeout on the reconnect attempt seeing |
I don't see it discussed in kubernetes/client-go#374 or kubernetes/kubernetes#65012, but we can probably protect against being load-balanced to a dead master (like the bootstrap node after bootkube exits) by setting |
Or maybe not? Looks like the default is 10 seconds. Maybe we're recycling an existing HTTP/2 connection. |
d2a110f
to
8819de4
Compare
25e8209
to
c32c6e6
Compare
Moving in this direction, I've pushed d2a110f -> c32c6e6, replacing my initial 5-second delay (which was not working) with a call to |
Success. Try again to check stability: /test e2e-aws |
Success again. One more try: /test e2e-aws |
Maybe too many tests in the same namespace? Once more to see: /test e2e-aws |
So back to the drawing board :/ |
c32c6e6
to
e9907b4
Compare
I've pushed c32c6e6 -> e9907b4, adding a belt-and-suspenders |
@wking https://godoc.org/k8s.io/client-go/rest#Config has |
/retest I'm trying to get a TCP dump to get a better handle on whether we're reconnecting or not. |
Actually, I didn't mean /test e2e-aws |
/retest |
e9907b4
to
406da21
Compare
/retest |
/retest |
We usually lose our watch when bootkube goes down. From job [1]: $ oc project ci-op-9g1vhqtz $ oc logs -f --timestamps e2e-aws -c setup | tee /tmp/setup.log ... 2018-11-03T04:19:55.121757935Z level=debug msg="added openshift-master-controllers.1563825412a4d77b: controller-manager-b5v49 became leader" ... 2018-11-03T04:20:14.679215171Z level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3069" 2018-11-03T04:20:16.539967372Z level=debug msg="added bootstrap-complete: cluster bootstrapping has completed" 2018-11-03T04:20:16.540030121Z level=info msg="Destroying the bootstrap resources..." ... And simultaneously: $ ssh -i libra.pem [email protected] journalctl -f | tee /tmp/bootstrap.log ... Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: All self-hosted control plane components successfully started Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: Tearing down temporary bootstrap control plane... Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.968877 840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-cluster-version-operator-ip-10-0-10-86_openshift-cluster-version(99ccfef8309f84bf88a0ca4a277097ac)", container cluster-version-operator: selfLink was empty, can't make reference Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.975624 840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-kube-apiserver-ip-10-0-10-86_kube-system(427a4a342e137b5a9bb39a0feff24625)", container kube-apiserver: selfLink was empty, can't make reference Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.002990 840 pod_container_deletor.go:75] Container "0c647bcd6317ac2e06b625c44151aa6a3487aa1c47c5f1468213756f9a48ef91" not found in pod's containers Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.005146 840 pod_container_deletor.go:75] Container "bc47233151f1c0afaaee9e7abcfec9a515fe0a720ed2251fd7a51602c59060c5" not found in pod's containers Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: progress.service holdoff time over, scheduling restart. Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Starting Report the completion of the cluster bootstrap process... Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Started Report the completion of the cluster bootstrap process. Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: Reporting install progress... Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: event/bootstrap-complete created Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: I1103 04:20:16.719526 840 reconciler.go:181] operationExecutor.UnmountVolume started for volume "secrets" (UniqueName: "kubernetes.io/host-path/427a4a342e137b5a9bb39a0feff24625-secrets") pod "427a4a342e137b5a9bb39a0feff24625" (UID: "427a4a342e137b5a9bb39a0feff24625") ... The resourceVersion watch will usually allow the re-created watcher to pick up where its predecessor left off [2]. But in at least some cases, that doesn't seem to be happening [3]: 2018/11/02 23:30:00 Running pod e2e-aws 2018/11/02 23:48:00 Container test in pod e2e-aws completed successfully 2018/11/02 23:51:52 Container teardown in pod e2e-aws completed successfully 2018/11/03 00:08:33 Copying artifacts from e2e-aws into /logs/artifacts/e2e-aws level=debug msg="Fetching \"Terraform Variables\"..." ... level=debug msg="added openshift-master-controllers.1563734e367132e0: controller-manager-xlw62 became leader" level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3288" level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition" 2018/11/03 00:08:33 Container setup in pod e2e-aws failed, exit code 1, reason Error That's unfortunately missing timestamps for the setup logs, but in the successful logs from ci-op-9g1vhqtz, you can see the controller-manager becoming a leader around 20 seconds before bootstrap-complete. That means the bootstrap-complete event probably fired around when the watcher dropped, which was probably well before the setup container timed out (~38 minutes after it was launched). The setup container timing out was probably the watch re-connect event hitting the 30 minute eventContext timeout. The issue seems to be that when watcherFunc returns an error, doReceive returned true, and receive exited, closing the doneChan (but *not* the resultChan). That left UntilWithoutRetry hung waiting for a result that never appeared (until the context timed it out). With this commit, I'm retrying the watch conncetion every two seconds until we get a successful reconnect (or the context times out). And I'm dropping the unused doneChan and closing resultChan instead to fix the hang mentioned in the previous paragraph. [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/595/pull-ci-openshift-installer-master-e2e-aws/1173 [2]: https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#concurrency-control-and-consistency [3]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/45/pull-ci-openshift-cluster-version-operator-master-e2e-aws/58/build-log.txt
406da21
to
109531f
Compare
I've pushed 406da21 -> 109531f, which I think fixes the issue here (I'd misunderstood |
Success in under 30 minutes:
Reproducible? /test e2e-aws |
Successful reconnect from the
And from the build log:
Once more: /test e2e-aws |
Another successful reconnect, this one handling some re-watch connection issues gracefully:
I'm satisfied, although I'd like to give this some cook time after landing before we revert #615 ;). |
The retry loop to create watcher is :| but I think we can iterate on it. /lgtm :yay: |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I'm not positive this is related, but around the time this merged we got a string of CI failures in the release payload creation job that looks like we aren't seeing the cluster stabilize before the e2e suite launches (some pods crash looping): Those of course could be preexisting failures in the operators, but we need to dig into whether we were masking the crashing before this merged |
This PR may cause the
However, that router detection ends in a somewhat concerning:
That "successfully rolled out" message looks like it comes straight from |
This reverts commit 6dc1bf6, openshift#615. 109531f (cmd/openshift-install/create: Retry watch connections, 2018-11-02, openshift#606) made the watch re-connects reliable, so make watch timeouts fatal again. This avoids confusing users by showing "Install complete!" messages when they may actually have a hung bootstrap process.
This reverts commit 6dc1bf6, openshift#615. 109531f (cmd/openshift-install/create: Retry watch connections, 2018-11-02, openshift#606) made the watch re-connects reliable, so make watch timeouts fatal again. This avoids confusing users by showing "Install complete!" messages when they may actually have a hung bootstrap process.
It seems like we lose our watch when bootkube goes down. From this job:
And simultaneously:
Ideally, the
resourceVersion
watch would allow the re-created watcher to pick up where its predecessor left off. But in at least some cases, that doesn't seem to be happening:That's unfortunately missing timestamps for the setup logs, but in the successful logs from ci-op-9g1vhqtz, you can see the
controller-manager
becoming a leader around 20 seconds beforebootstrap-complete
. That means thebootstrap-complete
event probably fired around when the watcher dropped, which was probably well before the setup container timed out (~38 minutes after it was launched). The setup container timing out was probably the watch re-connect event hitting the 30 minute eventContext timeout.I think what's happening is something like:
With this commit, I've added a short sleep to the watch re-connect. Hopefully this is enough to more consistently put us into step 7; as it stands in master now we seem to be about evenly split between 5 and 7.
A more robust approach would be to put a short connection timeout on the watch re-connect, so that even when we did end up in case 5, we'd give up before too long and re-try, with the second re-connect attemp ending up in case 7. But I'm currently not clear on how to code that up.
CC @abhinavdahiya, @crawford