cmd/openshift-install/create: Retry watch connections #606

wking · 2018-11-03T05:28:41Z

It seems like we lose our watch when bootkube goes down. From this job:

$ oc project ci-op-9g1vhqtz
$ oc logs -f --timestamps e2e-aws -c setup | tee /tmp/setup.log
...
2018-11-03T04:19:55.121757935Z level=debug msg="added openshift-master-controllers.1563825412a4d77b: controller-manager-b5v49 became leader"
...
2018-11-03T04:20:14.679215171Z level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3069"
2018-11-03T04:20:16.539967372Z level=debug msg="added bootstrap-complete: cluster bootstrapping has completed"
2018-11-03T04:20:16.540030121Z level=info msg="Destroying the bootstrap resources..."
...

And simultaneously:

$ ssh -i libra.pem [email protected] journalctl -f | tee /tmp/bootstrap.log
...
Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: All self-hosted control plane components successfully started
Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: Tearing down temporary bootstrap control plane...
Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.968877     840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-cluster-version-operator-ip-10-0-10-86_openshift-cluster-version(99ccfef8309f84bf88a0ca4a277097ac)", container cluster-version-operator: selfLink was empty, can't make reference
Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.975624     840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-kube-apiserver-ip-10-0-10-86_kube-system(427a4a342e137b5a9bb39a0feff24625)", container kube-apiserver: selfLink was empty, can't make reference
Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.002990     840 pod_container_deletor.go:75] Container "0c647bcd6317ac2e06b625c44151aa6a3487aa1c47c5f1468213756f9a48ef91" not found in pod's containers
Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.005146     840 pod_container_deletor.go:75] Container "bc47233151f1c0afaaee9e7abcfec9a515fe0a720ed2251fd7a51602c59060c5" not found in pod's containers
Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: progress.service holdoff time over, scheduling restart.
Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Starting Report the completion of the cluster bootstrap process...
Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Started Report the completion of the cluster bootstrap process.
Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: Reporting install progress...
Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: event/bootstrap-complete created
Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: I1103 04:20:16.719526     840 reconciler.go:181] operationExecutor.UnmountVolume started for volume "secrets" (UniqueName: "kubernetes.io/host-path/427a4a342e137b5a9bb39a0feff24625-secrets") pod "427a4a342e137b5a9bb39a0feff24625" (UID: "427a4a342e137b5a9bb39a0feff24625")
...

Ideally, the resourceVersion watch would allow the re-created watcher to pick up where its predecessor left off. But in at least some cases, that doesn't seem to be happening:

2018/11/02 23:30:00 Running pod e2e-aws
2018/11/02 23:48:00 Container test in pod e2e-aws completed successfully
2018/11/02 23:51:52 Container teardown in pod e2e-aws completed successfully
2018/11/03 00:08:33 Copying artifacts from e2e-aws into /logs/artifacts/e2e-aws
level=debug msg="Fetching \"Terraform Variables\"..."
...
level=debug msg="added openshift-master-controllers.1563734e367132e0: controller-manager-xlw62 became leader"
level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3288"
level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition"
2018/11/03 00:08:33 Container setup in pod e2e-aws failed, exit code 1, reason Error

That's unfortunately missing timestamps for the setup logs, but in the successful logs from ci-op-9g1vhqtz, you can see the controller-manager becoming a leader around 20 seconds before bootstrap-complete. That means the bootstrap-complete event probably fired around when the watcher dropped, which was probably well before the setup container timed out (~38 minutes after it was launched). The setup container timing out was probably the watch re-connect event hitting the 30 minute eventContext timeout.

I think what's happening is something like:

The pods bootkube is waiting for come up.
Bootkube tears itself down.
Our initial watch breaks.
The API becomes unstable.
Watch reconnects here hang forever.
The API stabilizes around the production control plane.
Watch reconnects here successfully reconnect and pick up where the broken watch left off.

With this commit, I've added a short sleep to the watch re-connect. Hopefully this is enough to more consistently put us into step 7; as it stands in master now we seem to be about evenly split between 5 and 7.

A more robust approach would be to put a short connection timeout on the watch re-connect, so that even when we did end up in case 5, we'd give up before too long and re-try, with the second re-connect attemp ending up in case 7. But I'm currently not clear on how to code that up.

CC @abhinavdahiya, @crawford

abhinavdahiya · 2018-11-03T05:48:43Z

We might be seeing this when bootkube takes the server out of service on bootstrap node...

kubernetes/client-go#374 (comment)

wking · 2018-11-03T06:02:02Z

We might be seeing this when bootkube takes the server out of service on bootstrap node...

Good link, and that suggests low NLB timeouts to address this issue. Can we do that easily? @crawford?

wking · 2018-11-03T06:50:31Z

Assorted references:

Our current hour-long idle timeout.
Docs for health checks, talking about a default of three consecutive 10-second timeouts (possibly with 30-second delays between attempts), before a target is scheduled for draining.
Docs for deregistration delays, talking about draining targets before marking them unused.
Terraform docs for the various target knobs.

wking · 2018-11-03T07:02:44Z

/test e2e-aws

Exercising openshift/release#2062

abhinavdahiya · 2018-11-03T14:27:50Z

@wking we should also tune our watch client to do very small dial timeouts.

crawford · 2018-11-03T14:32:38Z

The NLBs are configured here. I believe the reason for the long timeout is so that clients don't get disconnected while waiting for events.

wking · 2018-11-03T14:38:21Z

. I believe the reason for the long timeout is so that clients don't get disconnected while waiting for events.

Yeah, the long idle timeout should be fine. But we may want smaller target timeouts.

wking · 2018-11-03T14:44:24Z

we should also tune our watch client to do very small dial timeouts.

Sounds like the load-balancer keepalive means dial timeouts don't help.

As a temporary hack, how about a 5-minute timeout on the reconnect attempt seeing bootstrap-complete, and we warn and exit zero, but do not run destroy bootstrap if we hit that 5-minute timeout? That isn't the greatest UX, but it's still slightly better than out pre-auto-teardown UX ;).

wking · 2018-11-05T04:04:32Z

I don't see it discussed in kubernetes/client-go#374 or kubernetes/kubernetes#65012, but we can probably protect against being load-balanced to a dead master (like the bootstrap node after bootkube exits) by setting TLSHandshakeTimeout. That doesn't protect from "master died mid-connection without telling us", but we can worry about that later. Eventually one connection attempt will be routed to a live master.

wking · 2018-11-05T04:09:50Z

... by setting TLSHandshakeTimeout.

Or maybe not? Looks like the default is 10 seconds. Maybe we're recycling an existing HTTP/2 connection.

wking · 2018-11-05T06:52:34Z

... by setting TLSHandshakeTimeout.

Or maybe not? Looks like the default is 10 seconds. Maybe we're recycling an existing HTTP/2 connection.

Moving in this direction, I've pushed d2a110f -> c32c6e6, replacing my initial 5-second delay (which was not working) with a call to CloseIdleConnections. There's some... interesting... casting going on to access that method, discussion in the c32c6e6 commit message. But the idea is that this will (hopefully) close any cached keepalive or HTTP/2 connections and force a fresh connection for each reconnect, and that that will bring TLSHandshakeTimeout into play to protect us from "connecting" to the now-dead bootstrap control plane.

wking · 2018-11-05T07:28:31Z

Success. Try again to check stability:

/test e2e-aws

wking · 2018-11-05T07:59:31Z

Success again. One more try:

/test e2e-aws

wking · 2018-11-05T08:14:38Z

e2e-aws:

* module.vpc.aws_route.to_nat_gw[2]: 1 error(s) occurred:

* aws_route.to_nat_gw.2: Error finding route after creating it: Unable to find matching route for Route Table (rtb-03a92999b00075b2f) and destination CIDR block (0.0.0.0/0).

Maybe too many tests in the same namespace? Once more to see:

/test e2e-aws

wking · 2018-11-05T14:10:22Z

e2e-aws:

level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3217"
level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition"

So back to the drawing board :/

wking · 2018-11-05T18:40:20Z

I've pushed c32c6e6 -> e9907b4, adding a belt-and-suspenders transport.DisableKeepAlives = true. I'll see if I can get a tcpdump or some such to see how that plays in CI.

abhinavdahiya · 2018-11-05T18:53:21Z

@wking https://godoc.org/k8s.io/client-go/rest#Config has WrapTransport field that we might be able to override for custom keep-alive disable. Not sure if you already looked into that..

wking · 2018-11-05T19:12:26Z

/retest

I'm trying to get a TCP dump to get a better handle on whether we're reconnecting or not.

wking · 2018-11-05T19:12:44Z

Actually, I didn't mean /retest, I meant:

/test e2e-aws

wking · 2018-11-05T19:22:35Z

/retest

wking · 2018-11-05T19:43:11Z

images:

2018/11/05 19:41:04 Copying artifacts from release-latest into /logs/artifacts/release-latest
info: Using registry public hostname registry.svc.ci.openshift.org
info: Found 81 images in image stream
info: Manifests will be extracted to /tmp/release-image-0.0.1-2018-11-05T194045Z990286717
warning: Could not load current user information: user: unknown userid 1114450000
Unable to connect to the server: net/http: TLS handshake timeout
2018/11/05 19:41:05 Container release in pod release-latest failed, exit code 1, reason Error

/retest

wking · 2018-11-05T19:48:18Z

images:

2018/11/05 19:45:16 Copying artifacts from release-latest into /logs/artifacts/release-latest
info: Using registry public hostname registry.svc.ci.openshift.org
error: unable to check your credentials - pass --skip-check to bypass this error: Get https://registry.svc.ci.openshift.org/v2/: net/http: TLS handshake timeout
2018/11/05 19:45:16 Container release in pod release-latest failed, exit code 1, reason Error

/retest

We usually lose our watch when bootkube goes down. From job [1]: $ oc project ci-op-9g1vhqtz $ oc logs -f --timestamps e2e-aws -c setup | tee /tmp/setup.log ... 2018-11-03T04:19:55.121757935Z level=debug msg="added openshift-master-controllers.1563825412a4d77b: controller-manager-b5v49 became leader" ... 2018-11-03T04:20:14.679215171Z level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3069" 2018-11-03T04:20:16.539967372Z level=debug msg="added bootstrap-complete: cluster bootstrapping has completed" 2018-11-03T04:20:16.540030121Z level=info msg="Destroying the bootstrap resources..." ... And simultaneously: $ ssh -i libra.pem [email protected] journalctl -f | tee /tmp/bootstrap.log ... Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: All self-hosted control plane components successfully started Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: Tearing down temporary bootstrap control plane... Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.968877 840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-cluster-version-operator-ip-10-0-10-86_openshift-cluster-version(99ccfef8309f84bf88a0ca4a277097ac)", container cluster-version-operator: selfLink was empty, can't make reference Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.975624 840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-kube-apiserver-ip-10-0-10-86_kube-system(427a4a342e137b5a9bb39a0feff24625)", container kube-apiserver: selfLink was empty, can't make reference Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.002990 840 pod_container_deletor.go:75] Container "0c647bcd6317ac2e06b625c44151aa6a3487aa1c47c5f1468213756f9a48ef91" not found in pod's containers Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.005146 840 pod_container_deletor.go:75] Container "bc47233151f1c0afaaee9e7abcfec9a515fe0a720ed2251fd7a51602c59060c5" not found in pod's containers Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: progress.service holdoff time over, scheduling restart. Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Starting Report the completion of the cluster bootstrap process... Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Started Report the completion of the cluster bootstrap process. Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: Reporting install progress... Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: event/bootstrap-complete created Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: I1103 04:20:16.719526 840 reconciler.go:181] operationExecutor.UnmountVolume started for volume "secrets" (UniqueName: "kubernetes.io/host-path/427a4a342e137b5a9bb39a0feff24625-secrets") pod "427a4a342e137b5a9bb39a0feff24625" (UID: "427a4a342e137b5a9bb39a0feff24625") ... The resourceVersion watch will usually allow the re-created watcher to pick up where its predecessor left off [2]. But in at least some cases, that doesn't seem to be happening [3]: 2018/11/02 23:30:00 Running pod e2e-aws 2018/11/02 23:48:00 Container test in pod e2e-aws completed successfully 2018/11/02 23:51:52 Container teardown in pod e2e-aws completed successfully 2018/11/03 00:08:33 Copying artifacts from e2e-aws into /logs/artifacts/e2e-aws level=debug msg="Fetching \"Terraform Variables\"..." ... level=debug msg="added openshift-master-controllers.1563734e367132e0: controller-manager-xlw62 became leader" level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3288" level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition" 2018/11/03 00:08:33 Container setup in pod e2e-aws failed, exit code 1, reason Error That's unfortunately missing timestamps for the setup logs, but in the successful logs from ci-op-9g1vhqtz, you can see the controller-manager becoming a leader around 20 seconds before bootstrap-complete. That means the bootstrap-complete event probably fired around when the watcher dropped, which was probably well before the setup container timed out (~38 minutes after it was launched). The setup container timing out was probably the watch re-connect event hitting the 30 minute eventContext timeout. The issue seems to be that when watcherFunc returns an error, doReceive returned true, and receive exited, closing the doneChan (but *not* the resultChan). That left UntilWithoutRetry hung waiting for a result that never appeared (until the context timed it out). With this commit, I'm retrying the watch conncetion every two seconds until we get a successful reconnect (or the context times out). And I'm dropping the unused doneChan and closing resultChan instead to fix the hang mentioned in the previous paragraph. [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/595/pull-ci-openshift-installer-master-e2e-aws/1173 [2]: https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#concurrency-control-and-consistency [3]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/45/pull-ci-openshift-cluster-version-operator-master-e2e-aws/58/build-log.txt

wking · 2018-11-10T01:08:47Z

I've pushed 406da21 -> 109531f, which I think fixes the issue here (I'd misunderstood Until's watchHandler and there was a resultChan-not-getting-closed bug in RetryWatcher as discussed in the new commit message). Can folks give this a spin?

wking · 2018-11-10T01:47:30Z

Success in under 30 minutes:

2018/11/10 01:10:31 Running pod e2e-aws
2018/11/10 01:28:08 Container setup in pod e2e-aws completed successfully
2018/11/10 01:30:45 Container test in pod e2e-aws completed successfully
2018/11/10 01:37:35 Container teardown in pod e2e-aws completed successfully
2018/11/10 01:37:36 Pod e2e-aws succeeded after 27m5s

Reproducible?

/test e2e-aws

wking · 2018-11-10T02:14:21Z

Successful reconnect from the setup container logs:

level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 2987"
level=debug msg="added bootstrap-complete: cluster bootstrapping has completed"
level=info msg="Destroying the bootstrap resources..."

And from the build log:

 2018/11/10 01:49:20 Running pod e2e-aws
2018/11/10 02:03:54 Container setup in pod e2e-aws completed successfully
2018/11/10 02:07:01 Container test in pod e2e-aws completed successfully
2018/11/10 02:12:45 Container teardown in pod e2e-aws completed successfully 2018/11/10 02:12:45 Pod e2e-aws succeeded after 23m25s

Once more:

/test e2e-aws

wking · 2018-11-10T02:42:38Z

Another successful reconnect, this one handling some re-watch connection issues gracefully:

level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 2889"
level=warning msg="Failed to connect events watcher: Get https://ci-op-zbc3k0fy-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=2889&watch=true:  dial tcp 54.162.14.82:6443: connect: connection refused"
level=warning msg="Failed to connect events watcher: Get https://ci-op-zbc3k0fy-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=2889&watch=true:  dial tcp 54.162.14.82:6443: connect: connection refused"
level=debug msg="added bootstrap-complete: cluster bootstrapping has completed"
level=info msg="Destroying the bootstrap resources..."

Build log:

2018/11/10 02:16:25 Running pod e2e-aws
2018/11/10 02:30:46 Container setup in pod e2e-aws completed successfully
2018/11/10 02:33:32 Container test in pod e2e-aws completed successfully
2018/11/10 02:38:31 Container teardown in pod e2e-aws completed successfully
2018/11/10 02:38:31 Pod e2e-aws succeeded after 22m7s

I'm satisfied, although I'd like to give this some cook time after landing before we revert #615 ;).

abhinavdahiya · 2018-11-10T04:01:11Z

The retry loop to create watcher is :| but I think we can iterate on it.

/lgtm

:yay:

openshift-ci-robot · 2018-11-10T04:01:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2018-11-11T18:53:04Z

I'm not positive this is related, but around the time this merged we got a string of CI failures in the release payload creation job

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0

that looks like we aren't seeing the cluster stabilize before the e2e suite launches (some pods crash looping):

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/1528?log#log

Those of course could be preexisting failures in the operators, but we need to dig into whether we were masking the crashing before this merged

wking · 2018-11-12T05:34:07Z

... that looks like we aren't seeing the cluster stabilize before the e2e suite launches...

This PR may cause the setup container to exit earlier (just after the bootstrap-complete event, instead of hanging for 30 minutes post-API-up). That could cause the test container to fire earlier than it had been, although you can see from the logs of the job you mention that the test container goes through the additional router wait:

2018/11/11 04:44:41 Container setup in pod release-e2e-aws completed successfully
Setup success
NAME                           STATUS     ROLES     AGE       VERSION
ip-10-0-130-133.ec2.internal   NotReady   worker    26s       v1.11.0+d4cacc0
ip-10-0-158-232.ec2.internal   NotReady   worker    22s       v1.11.0+d4cacc0
ip-10-0-16-236.ec2.internal    Ready      master    5m        v1.11.0+d4cacc0
ip-10-0-162-66.ec2.internal    NotReady   worker    19s       v1.11.0+d4cacc0
ip-10-0-3-203.ec2.internal     Ready      master    5m        v1.11.0+d4cacc0
ip-10-0-38-91.ec2.internal     Ready      master    5m        v1.11.0+d4cacc0
API at https://ci-op-nfbzxlfx-5a633-api.origin-ci-int-aws.dev.rhcloud.com:6443 has responded
Waiting for router to be created ...

However, that router detection ends in a somewhat concerning:

NAME             DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
router-default   3         3         0         3            0           node-role.kubernetes.io/worker=   5s
Found router in openshift-ingress
Waiting for daemon set "router-default" rollout to finish: 0 of 3 updated pods are available...
error: watch closed before Until timeout
error openshift-ingress/ds/router-default did not come up
Unable to connect to the server: unexpected EOF
error openshift-ingress/ds/router-default did not come up
daemon set "router-default" successfully rolled out

That "successfully rolled out" message looks like it comes straight from k8s.io/kubernetes/pkg/kubectl/rollout_status.go, and I haven't looked into how it gets into the test container logs. But if there's a bug here I suspect it has to do with either the router-waiting code or the tests needing more than the router to be functioning.

This reverts commit 6dc1bf6, openshift#615. 109531f (cmd/openshift-install/create: Retry watch connections, 2018-11-02, openshift#606) made the watch re-connects reliable, so make watch timeouts fatal again. This avoids confusing users by showing "Install complete!" messages when they may actually have a hung bootstrap process.

openshift-ci-robot requested review from rajatchopra and staebler November 3, 2018 05:28

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 3, 2018

wking force-pushed the delay-watch-reconnect branch from 8e1bcbf to d2a110f Compare November 3, 2018 05:30

wking force-pushed the delay-watch-reconnect branch from d2a110f to 8819de4 Compare November 5, 2018 06:29

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 5, 2018

wking changed the title ~~cmd/openshift-install/create: New watcher 5-second sleep~~ cmd/openshift-install/create: Close idle connections for watch restarts Nov 5, 2018

wking force-pushed the delay-watch-reconnect branch 2 times, most recently from 25e8209 to c32c6e6 Compare November 5, 2018 06:52

ironcladlou mentioned this pull request Nov 5, 2018

Wire cloud-provider to kube-controller-manager openshift/cluster-kube-controller-manager-operator#65

Merged

wking force-pushed the delay-watch-reconnect branch from c32c6e6 to e9907b4 Compare November 5, 2018 18:39

wking force-pushed the delay-watch-reconnect branch from e9907b4 to 406da21 Compare November 5, 2018 19:35

wking mentioned this pull request Nov 5, 2018

cmd/openshift-install/create: Allow hung watch #615

Merged

wking force-pushed the delay-watch-reconnect branch from 406da21 to 109531f Compare November 10, 2018 01:06

wking changed the title ~~cmd/openshift-install/create: Close idle connections for watch restarts~~ cmd/openshift-install/create: Retry watch connections Nov 10, 2018

openshift-ci-robot assigned abhinavdahiya Nov 10, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 10, 2018

openshift-merge-robot merged commit 09aa205 into openshift:master Nov 10, 2018

wking deleted the delay-watch-reconnect branch November 10, 2018 17:42

wking mentioned this pull request Nov 27, 2018

clarify logComplete msgs and return err if wait for bootstrap complete exceeds timeout #727

Closed

wking mentioned this pull request Nov 27, 2018

Revert "cmd/openshift-install/create: Allow hung watch" #741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/openshift-install/create: Retry watch connections #606

cmd/openshift-install/create: Retry watch connections #606

wking commented Nov 3, 2018

abhinavdahiya commented Nov 3, 2018 •

edited

Loading

wking commented Nov 3, 2018

wking commented Nov 3, 2018 •

edited

Loading

wking commented Nov 3, 2018

abhinavdahiya commented Nov 3, 2018

crawford commented Nov 3, 2018 •

edited

Loading

wking commented Nov 3, 2018

wking commented Nov 3, 2018 •

edited

Loading

wking commented Nov 5, 2018

wking commented Nov 5, 2018 •

edited

Loading

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

abhinavdahiya commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 10, 2018

wking commented Nov 10, 2018

wking commented Nov 10, 2018

wking commented Nov 10, 2018

abhinavdahiya commented Nov 10, 2018

openshift-ci-robot commented Nov 10, 2018

smarterclayton commented Nov 11, 2018 •

edited

Loading

wking commented Nov 12, 2018

cmd/openshift-install/create: Retry watch connections #606

cmd/openshift-install/create: Retry watch connections #606

Conversation

wking commented Nov 3, 2018

abhinavdahiya commented Nov 3, 2018 • edited Loading

wking commented Nov 3, 2018

wking commented Nov 3, 2018 • edited Loading

wking commented Nov 3, 2018

abhinavdahiya commented Nov 3, 2018

crawford commented Nov 3, 2018 • edited Loading

wking commented Nov 3, 2018

wking commented Nov 3, 2018 • edited Loading

wking commented Nov 5, 2018

wking commented Nov 5, 2018 • edited Loading

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

abhinavdahiya commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 5, 2018

wking commented Nov 10, 2018

wking commented Nov 10, 2018

wking commented Nov 10, 2018

wking commented Nov 10, 2018

abhinavdahiya commented Nov 10, 2018

openshift-ci-robot commented Nov 10, 2018

smarterclayton commented Nov 11, 2018 • edited Loading

wking commented Nov 12, 2018

abhinavdahiya commented Nov 3, 2018 •

edited

Loading

wking commented Nov 3, 2018 •

edited

Loading

crawford commented Nov 3, 2018 •

edited

Loading

wking commented Nov 3, 2018 •

edited

Loading

wking commented Nov 5, 2018 •

edited

Loading

smarterclayton commented Nov 11, 2018 •

edited

Loading