Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/openshift-install/create: Retry watch connections #606

Merged
merged 1 commit into from
Nov 10, 2018

Conversation

wking
Copy link
Member

@wking wking commented Nov 3, 2018

It seems like we lose our watch when bootkube goes down. From this job:

$ oc project ci-op-9g1vhqtz
$ oc logs -f --timestamps e2e-aws -c setup | tee /tmp/setup.log
...
2018-11-03T04:19:55.121757935Z level=debug msg="added openshift-master-controllers.1563825412a4d77b: controller-manager-b5v49 became leader"
...
2018-11-03T04:20:14.679215171Z level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3069"
2018-11-03T04:20:16.539967372Z level=debug msg="added bootstrap-complete: cluster bootstrapping has completed"
2018-11-03T04:20:16.540030121Z level=info msg="Destroying the bootstrap resources..."
...

And simultaneously:

$ ssh -i libra.pem [email protected] journalctl -f | tee /tmp/bootstrap.log
...
Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: All self-hosted control plane components successfully started
Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: Tearing down temporary bootstrap control plane...
Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.968877     840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-cluster-version-operator-ip-10-0-10-86_openshift-cluster-version(99ccfef8309f84bf88a0ca4a277097ac)", container cluster-version-operator: selfLink was empty, can't make reference
Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.975624     840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-kube-apiserver-ip-10-0-10-86_kube-system(427a4a342e137b5a9bb39a0feff24625)", container kube-apiserver: selfLink was empty, can't make reference
Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.002990     840 pod_container_deletor.go:75] Container "0c647bcd6317ac2e06b625c44151aa6a3487aa1c47c5f1468213756f9a48ef91" not found in pod's containers
Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.005146     840 pod_container_deletor.go:75] Container "bc47233151f1c0afaaee9e7abcfec9a515fe0a720ed2251fd7a51602c59060c5" not found in pod's containers
Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: progress.service holdoff time over, scheduling restart.
Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Starting Report the completion of the cluster bootstrap process...
Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Started Report the completion of the cluster bootstrap process.
Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: Reporting install progress...
Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: event/bootstrap-complete created
Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: I1103 04:20:16.719526     840 reconciler.go:181] operationExecutor.UnmountVolume started for volume "secrets" (UniqueName: "kubernetes.io/host-path/427a4a342e137b5a9bb39a0feff24625-secrets") pod "427a4a342e137b5a9bb39a0feff24625" (UID: "427a4a342e137b5a9bb39a0feff24625")
...

Ideally, the resourceVersion watch would allow the re-created watcher to pick up where its predecessor left off. But in at least some cases, that doesn't seem to be happening:

2018/11/02 23:30:00 Running pod e2e-aws
2018/11/02 23:48:00 Container test in pod e2e-aws completed successfully
2018/11/02 23:51:52 Container teardown in pod e2e-aws completed successfully
2018/11/03 00:08:33 Copying artifacts from e2e-aws into /logs/artifacts/e2e-aws
level=debug msg="Fetching \"Terraform Variables\"..."
...
level=debug msg="added openshift-master-controllers.1563734e367132e0: controller-manager-xlw62 became leader"
level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3288"
level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition"
2018/11/03 00:08:33 Container setup in pod e2e-aws failed, exit code 1, reason Error

That's unfortunately missing timestamps for the setup logs, but in the successful logs from ci-op-9g1vhqtz, you can see the controller-manager becoming a leader around 20 seconds before bootstrap-complete. That means the bootstrap-complete event probably fired around when the watcher dropped, which was probably well before the setup container timed out (~38 minutes after it was launched). The setup container timing out was probably the watch re-connect event hitting the 30 minute eventContext timeout.

I think what's happening is something like:

  1. The pods bootkube is waiting for come up.
  2. Bootkube tears itself down.
  3. Our initial watch breaks.
  4. The API becomes unstable.
  5. Watch reconnects here hang forever.
  6. The API stabilizes around the production control plane.
  7. Watch reconnects here successfully reconnect and pick up where the broken watch left off.

With this commit, I've added a short sleep to the watch re-connect. Hopefully this is enough to more consistently put us into step 7; as it stands in master now we seem to be about evenly split between 5 and 7.

A more robust approach would be to put a short connection timeout on the watch re-connect, so that even when we did end up in case 5, we'd give up before too long and re-try, with the second re-connect attemp ending up in case 7. But I'm currently not clear on how to code that up.

CC @abhinavdahiya, @crawford

@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 3, 2018
@wking wking force-pushed the delay-watch-reconnect branch from 8e1bcbf to d2a110f Compare November 3, 2018 05:30
@abhinavdahiya
Copy link
Contributor

abhinavdahiya commented Nov 3, 2018

We might be seeing this when bootkube takes the server out of service on bootstrap node...

kubernetes/client-go#374 (comment)

@wking
Copy link
Member Author

wking commented Nov 3, 2018

We might be seeing this when bootkube takes the server out of service on bootstrap node...

Good link, and that suggests low NLB timeouts to address this issue. Can we do that easily? @crawford?

@wking
Copy link
Member Author

wking commented Nov 3, 2018

Assorted references:

  • Our current hour-long idle timeout.
  • Docs for health checks, talking about a default of three consecutive 10-second timeouts (possibly with 30-second delays between attempts), before a target is scheduled for draining.
  • Docs for deregistration delays, talking about draining targets before marking them unused.
  • Terraform docs for the various target knobs.

@wking
Copy link
Member Author

wking commented Nov 3, 2018

/test e2e-aws

Exercising openshift/release#2062

@abhinavdahiya
Copy link
Contributor

@wking we should also tune our watch client to do very small dial timeouts.

@crawford
Copy link
Contributor

crawford commented Nov 3, 2018

The NLBs are configured here. I believe the reason for the long timeout is so that clients don't get disconnected while waiting for events.

@wking
Copy link
Member Author

wking commented Nov 3, 2018

. I believe the reason for the long timeout is so that clients don't get disconnected while waiting for events.

Yeah, the long idle timeout should be fine. But we may want smaller target timeouts.

@wking
Copy link
Member Author

wking commented Nov 3, 2018

we should also tune our watch client to do very small dial timeouts.

Sounds like the load-balancer keepalive means dial timeouts don't help.

As a temporary hack, how about a 5-minute timeout on the reconnect attempt seeing bootstrap-complete, and we warn and exit zero, but do not run destroy bootstrap if we hit that 5-minute timeout? That isn't the greatest UX, but it's still slightly better than out pre-auto-teardown UX ;).

@wking
Copy link
Member Author

wking commented Nov 5, 2018

I don't see it discussed in kubernetes/client-go#374 or kubernetes/kubernetes#65012, but we can probably protect against being load-balanced to a dead master (like the bootstrap node after bootkube exits) by setting TLSHandshakeTimeout. That doesn't protect from "master died mid-connection without telling us", but we can worry about that later. Eventually one connection attempt will be routed to a live master.

@wking
Copy link
Member Author

wking commented Nov 5, 2018

... by setting TLSHandshakeTimeout.

Or maybe not? Looks like the default is 10 seconds. Maybe we're recycling an existing HTTP/2 connection.

@wking wking force-pushed the delay-watch-reconnect branch from d2a110f to 8819de4 Compare November 5, 2018 06:29
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 5, 2018
@wking wking changed the title cmd/openshift-install/create: New watcher 5-second sleep cmd/openshift-install/create: Close idle connections for watch restarts Nov 5, 2018
@wking wking force-pushed the delay-watch-reconnect branch 2 times, most recently from 25e8209 to c32c6e6 Compare November 5, 2018 06:52
@wking
Copy link
Member Author

wking commented Nov 5, 2018

... by setting TLSHandshakeTimeout.

Or maybe not? Looks like the default is 10 seconds. Maybe we're recycling an existing HTTP/2 connection.

Moving in this direction, I've pushed d2a110f -> c32c6e6, replacing my initial 5-second delay (which was not working) with a call to CloseIdleConnections. There's some... interesting... casting going on to access that method, discussion in the c32c6e6 commit message. But the idea is that this will (hopefully) close any cached keepalive or HTTP/2 connections and force a fresh connection for each reconnect, and that that will bring TLSHandshakeTimeout into play to protect us from "connecting" to the now-dead bootstrap control plane.

@wking
Copy link
Member Author

wking commented Nov 5, 2018

Success. Try again to check stability:

/test e2e-aws

@wking
Copy link
Member Author

wking commented Nov 5, 2018

Success again. One more try:

/test e2e-aws

@wking
Copy link
Member Author

wking commented Nov 5, 2018

e2e-aws:

* module.vpc.aws_route.to_nat_gw[2]: 1 error(s) occurred:

* aws_route.to_nat_gw.2: Error finding route after creating it: Unable to find matching route for Route Table (rtb-03a92999b00075b2f) and destination CIDR block (0.0.0.0/0).

Maybe too many tests in the same namespace? Once more to see:

/test e2e-aws

@wking
Copy link
Member Author

wking commented Nov 5, 2018

e2e-aws:

level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3217"
level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition"

So back to the drawing board :/

@wking
Copy link
Member Author

wking commented Nov 5, 2018

I've pushed c32c6e6 -> e9907b4, adding a belt-and-suspenders transport.DisableKeepAlives = true. I'll see if I can get a tcpdump or some such to see how that plays in CI.

@abhinavdahiya
Copy link
Contributor

@wking https://godoc.org/k8s.io/client-go/rest#Config has WrapTransport field that we might be able to override for custom keep-alive disable. Not sure if you already looked into that..

@wking
Copy link
Member Author

wking commented Nov 5, 2018

/retest

I'm trying to get a TCP dump to get a better handle on whether we're reconnecting or not.

@wking
Copy link
Member Author

wking commented Nov 5, 2018

Actually, I didn't mean /retest, I meant:

/test e2e-aws

@wking
Copy link
Member Author

wking commented Nov 5, 2018

/retest

@wking wking force-pushed the delay-watch-reconnect branch from e9907b4 to 406da21 Compare November 5, 2018 19:35
@wking
Copy link
Member Author

wking commented Nov 5, 2018

images:

2018/11/05 19:41:04 Copying artifacts from release-latest into /logs/artifacts/release-latest
info: Using registry public hostname registry.svc.ci.openshift.org
info: Found 81 images in image stream
info: Manifests will be extracted to /tmp/release-image-0.0.1-2018-11-05T194045Z990286717
warning: Could not load current user information: user: unknown userid 1114450000
Unable to connect to the server: net/http: TLS handshake timeout
2018/11/05 19:41:05 Container release in pod release-latest failed, exit code 1, reason Error

/retest

@wking
Copy link
Member Author

wking commented Nov 5, 2018

images:

2018/11/05 19:45:16 Copying artifacts from release-latest into /logs/artifacts/release-latest
info: Using registry public hostname registry.svc.ci.openshift.org
error: unable to check your credentials - pass --skip-check to bypass this error: Get https://registry.svc.ci.openshift.org/v2/: net/http: TLS handshake timeout
2018/11/05 19:45:16 Container release in pod release-latest failed, exit code 1, reason Error

/retest

We usually lose our watch when bootkube goes down.  From job [1]:

  $ oc project ci-op-9g1vhqtz
  $ oc logs -f --timestamps e2e-aws -c setup | tee /tmp/setup.log
  ...
  2018-11-03T04:19:55.121757935Z level=debug msg="added openshift-master-controllers.1563825412a4d77b: controller-manager-b5v49 became leader"
  ...
  2018-11-03T04:20:14.679215171Z level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3069"
  2018-11-03T04:20:16.539967372Z level=debug msg="added bootstrap-complete: cluster bootstrapping has completed"
  2018-11-03T04:20:16.540030121Z level=info msg="Destroying the bootstrap resources..."
  ...

And simultaneously:

  $ ssh -i libra.pem [email protected] journalctl -f | tee /tmp/bootstrap.log
  ...
  Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: All self-hosted control plane components successfully started
  Nov 03 04:20:14 ip-10-0-10-86 bootkube.sh[1033]: Tearing down temporary bootstrap control plane...
  Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.968877     840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-cluster-version-operator-ip-10-0-10-86_openshift-cluster-version(99ccfef8309f84bf88a0ca4a277097ac)", container cluster-version-operator: selfLink was empty, can't make reference
  Nov 03 04:20:15 ip-10-0-10-86 hyperkube[840]: E1103 04:20:15.975624     840 kuberuntime_container.go:65] Can't make a ref to pod "bootstrap-kube-apiserver-ip-10-0-10-86_kube-system(427a4a342e137b5a9bb39a0feff24625)", container kube-apiserver: selfLink was empty, can't make reference
  Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.002990     840 pod_container_deletor.go:75] Container "0c647bcd6317ac2e06b625c44151aa6a3487aa1c47c5f1468213756f9a48ef91" not found in pod's containers
  Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: W1103 04:20:16.005146     840 pod_container_deletor.go:75] Container "bc47233151f1c0afaaee9e7abcfec9a515fe0a720ed2251fd7a51602c59060c5" not found in pod's containers
  Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: progress.service holdoff time over, scheduling restart.
  Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Starting Report the completion of the cluster bootstrap process...
  Nov 03 04:20:16 ip-10-0-10-86 systemd[1]: Started Report the completion of the cluster bootstrap process.
  Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: Reporting install progress...
  Nov 03 04:20:16 ip-10-0-10-86 report-progress.sh[6828]: event/bootstrap-complete created
  Nov 03 04:20:16 ip-10-0-10-86 hyperkube[840]: I1103 04:20:16.719526     840 reconciler.go:181] operationExecutor.UnmountVolume started for volume "secrets" (UniqueName: "kubernetes.io/host-path/427a4a342e137b5a9bb39a0feff24625-secrets") pod "427a4a342e137b5a9bb39a0feff24625" (UID: "427a4a342e137b5a9bb39a0feff24625")
  ...

The resourceVersion watch will usually allow the re-created watcher to
pick up where its predecessor left off [2].  But in at least some
cases, that doesn't seem to be happening [3]:

  2018/11/02 23:30:00 Running pod e2e-aws
  2018/11/02 23:48:00 Container test in pod e2e-aws completed successfully
  2018/11/02 23:51:52 Container teardown in pod e2e-aws completed successfully
  2018/11/03 00:08:33 Copying artifacts from e2e-aws into /logs/artifacts/e2e-aws
  level=debug msg="Fetching \"Terraform Variables\"..."
  ...
  level=debug msg="added openshift-master-controllers.1563734e367132e0: controller-manager-xlw62 became leader"
  level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 3288"
  level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition"
  2018/11/03 00:08:33 Container setup in pod e2e-aws failed, exit code 1, reason Error

That's unfortunately missing timestamps for the setup logs, but in the
successful logs from ci-op-9g1vhqtz, you can see the
controller-manager becoming a leader around 20 seconds before
bootstrap-complete.  That means the bootstrap-complete event probably
fired around when the watcher dropped, which was probably well before
the setup container timed out (~38 minutes after it was launched).
The setup container timing out was probably the watch re-connect event
hitting the 30 minute eventContext timeout.

The issue seems to be that when watcherFunc returns an error,
doReceive returned true, and receive exited, closing the doneChan (but
*not* the resultChan).  That left UntilWithoutRetry hung waiting for a
result that never appeared (until the context timed it out).

With this commit, I'm retrying the watch conncetion every two seconds
until we get a successful reconnect (or the context times out).  And
I'm dropping the unused doneChan and closing resultChan instead to fix
the hang mentioned in the previous paragraph.

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/595/pull-ci-openshift-installer-master-e2e-aws/1173
[2]: https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#concurrency-control-and-consistency
[3]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/45/pull-ci-openshift-cluster-version-operator-master-e2e-aws/58/build-log.txt
@wking wking force-pushed the delay-watch-reconnect branch from 406da21 to 109531f Compare November 10, 2018 01:06
@wking
Copy link
Member Author

wking commented Nov 10, 2018

I've pushed 406da21 -> 109531f, which I think fixes the issue here (I'd misunderstood Until's watchHandler and there was a resultChan-not-getting-closed bug in RetryWatcher as discussed in the new commit message). Can folks give this a spin?

@wking wking changed the title cmd/openshift-install/create: Close idle connections for watch restarts cmd/openshift-install/create: Retry watch connections Nov 10, 2018
@wking
Copy link
Member Author

wking commented Nov 10, 2018

Success in under 30 minutes:

2018/11/10 01:10:31 Running pod e2e-aws
2018/11/10 01:28:08 Container setup in pod e2e-aws completed successfully
2018/11/10 01:30:45 Container test in pod e2e-aws completed successfully
2018/11/10 01:37:35 Container teardown in pod e2e-aws completed successfully
2018/11/10 01:37:36 Pod e2e-aws succeeded after 27m5s

Reproducible?

/test e2e-aws

@wking
Copy link
Member Author

wking commented Nov 10, 2018

Successful reconnect from the setup container logs:

level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 2987"
level=debug msg="added bootstrap-complete: cluster bootstrapping has completed"
level=info msg="Destroying the bootstrap resources..."

And from the build log:

 2018/11/10 01:49:20 Running pod e2e-aws
2018/11/10 02:03:54 Container setup in pod e2e-aws completed successfully
2018/11/10 02:07:01 Container test in pod e2e-aws completed successfully
2018/11/10 02:12:45 Container teardown in pod e2e-aws completed successfully 2018/11/10 02:12:45 Pod e2e-aws succeeded after 23m25s 

Once more:

/test e2e-aws

@wking
Copy link
Member Author

wking commented Nov 10, 2018

Another successful reconnect, this one handling some re-watch connection issues gracefully:

level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 2889"
level=warning msg="Failed to connect events watcher: Get https://ci-op-zbc3k0fy-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=2889&watch=true:  dial tcp 54.162.14.82:6443: connect: connection refused"
level=warning msg="Failed to connect events watcher: Get https://ci-op-zbc3k0fy-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=2889&watch=true:  dial tcp 54.162.14.82:6443: connect: connection refused"
level=debug msg="added bootstrap-complete: cluster bootstrapping has completed"
level=info msg="Destroying the bootstrap resources..."

Build log:

2018/11/10 02:16:25 Running pod e2e-aws
2018/11/10 02:30:46 Container setup in pod e2e-aws completed successfully
2018/11/10 02:33:32 Container test in pod e2e-aws completed successfully
2018/11/10 02:38:31 Container teardown in pod e2e-aws completed successfully
2018/11/10 02:38:31 Pod e2e-aws succeeded after 22m7s 

I'm satisfied, although I'd like to give this some cook time after landing before we revert #615 ;).

@abhinavdahiya
Copy link
Contributor

The retry loop to create watcher is :| but I think we can iterate on it.

/lgtm

:yay:

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 10, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 09aa205 into openshift:master Nov 10, 2018
@wking wking deleted the delay-watch-reconnect branch November 10, 2018 17:42
@smarterclayton
Copy link
Contributor

smarterclayton commented Nov 11, 2018

I'm not positive this is related, but around the time this merged we got a string of CI failures in the release payload creation job

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0

that looks like we aren't seeing the cluster stabilize before the e2e suite launches (some pods crash looping):

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/1528?log#log

Those of course could be preexisting failures in the operators, but we need to dig into whether we were masking the crashing before this merged

@wking
Copy link
Member Author

wking commented Nov 12, 2018

... that looks like we aren't seeing the cluster stabilize before the e2e suite launches...

This PR may cause the setup container to exit earlier (just after the bootstrap-complete event, instead of hanging for 30 minutes post-API-up). That could cause the test container to fire earlier than it had been, although you can see from the logs of the job you mention that the test container goes through the additional router wait:

2018/11/11 04:44:41 Container setup in pod release-e2e-aws completed successfully
Setup success
NAME                           STATUS     ROLES     AGE       VERSION
ip-10-0-130-133.ec2.internal   NotReady   worker    26s       v1.11.0+d4cacc0
ip-10-0-158-232.ec2.internal   NotReady   worker    22s       v1.11.0+d4cacc0
ip-10-0-16-236.ec2.internal    Ready      master    5m        v1.11.0+d4cacc0
ip-10-0-162-66.ec2.internal    NotReady   worker    19s       v1.11.0+d4cacc0
ip-10-0-3-203.ec2.internal     Ready      master    5m        v1.11.0+d4cacc0
ip-10-0-38-91.ec2.internal     Ready      master    5m        v1.11.0+d4cacc0
API at https://ci-op-nfbzxlfx-5a633-api.origin-ci-int-aws.dev.rhcloud.com:6443 has responded
Waiting for router to be created ...

However, that router detection ends in a somewhat concerning:

NAME             DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
router-default   3         3         0         3            0           node-role.kubernetes.io/worker=   5s
Found router in openshift-ingress
Waiting for daemon set "router-default" rollout to finish: 0 of 3 updated pods are available...
error: watch closed before Until timeout
error openshift-ingress/ds/router-default did not come up
Unable to connect to the server: unexpected EOF
error openshift-ingress/ds/router-default did not come up
daemon set "router-default" successfully rolled out

That "successfully rolled out" message looks like it comes straight from k8s.io/kubernetes/pkg/kubectl/rollout_status.go, and I haven't looked into how it gets into the test container logs. But if there's a bug here I suspect it has to do with either the router-waiting code or the tests needing more than the router to be functioning.

wking added a commit to wking/openshift-installer that referenced this pull request Nov 27, 2018
This reverts commit 6dc1bf6, openshift#615.

109531f (cmd/openshift-install/create: Retry watch connections,
2018-11-02, openshift#606) made the watch re-connects reliable, so make watch
timeouts fatal again.  This avoids confusing users by showing "Install
complete!" messages when they may actually have a hung bootstrap
process.
flaper87 pushed a commit to flaper87/installer that referenced this pull request Nov 29, 2018
This reverts commit 6dc1bf6, openshift#615.

109531f (cmd/openshift-install/create: Retry watch connections,
2018-11-02, openshift#606) made the watch re-connects reliable, so make watch
timeouts fatal again.  This avoids confusing users by showing "Install
complete!" messages when they may actually have a hung bootstrap
process.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants