Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration tests are exhibiting flaky behavior #84

Open
jessicayuen opened this issue May 12, 2020 · 2 comments
Open

Integration tests are exhibiting flaky behavior #84

jessicayuen opened this issue May 12, 2020 · 2 comments
Labels
bug Something isn't working

Comments

@jessicayuen
Copy link
Member

jessicayuen commented May 12, 2020

Seems to be failing more frequently:

=== RUN   TestXdsClientGetsIncrementalResponsesFromUpstreamServer
2020/05/12 20:31:22 management server listening on 19001
    TestXdsClientGetsIncrementalResponsesFromUpstreamServer: upstream_client_test.go:47: 
        	Error Trace:	upstream_client_test.go:47
        	Error:      	Setup failed: %s
        	Test:       	TestXdsClientGetsIncrementalResponsesFromUpstreamServer
        	Messages:   	rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:19001: connect: connection refused"

And,

=== RUN   TestServerShutdownShouldCloseResponseChannel
2020/05/12 18:43:45 listen tcp :19001: bind: address already in use
FAIL	github.com/envoyproxy/xds-relay/integration	7.554s
FAIL
make: *** [integration-tests] Error 1
@jessicayuen jessicayuen added the bug Something isn't working label May 12, 2020
@eapolinario
Copy link
Contributor

eapolinario commented May 14, 2020

As we mentioned during our sync meeting today, these are 2 separate errors:

  1. This is related to https://github.com/envoyproxy/xds-relay/blob/master/integration/upstream_client_test.go#L208-L218. As @LisaLudique pointed out, a connection refused means that nothing is listening on that port, which can happen if the goroutine that starts the management server doesn't run before we try to connect to it. @jyotimahapatra created an issue to configure the grpc options in that call to the management server, so we could experiment configuring retries there. Fundamentally though, the problem really is because we don't have a way to signal that the management server is ready before we try to connect to it.

  2. This is caused by multiple test trying start the management server on the same port. We pass in a context in the goroutine to start the management server, which gracefully shuts down the server.
    We don't have much insight into how the go runtime schedules the tests (besides the fact that in the same package they are run serially), but if I were to guess, at least on Linux we're not leaving enough time for the interaction between the runtime and the OS to go through the usual workflow of closing the TCP connection. In the e2e tests we wait 1 second between tests, we might adopt a similar approach in the integration tests suite (as a side note, 1s is too much).

@eapolinario
Copy link
Contributor

jessicayuen pushed a commit that referenced this issue May 19, 2020
We retry the integration tests up to 3 times using https://github.com/marketplace/actions/retry-step.

This is a stopgap until we find out a better solution to solve issue described in #84 (comment).

I'm adding the failing test just to confirm the retries actually happen.

Signed-off-by: eapolinario <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants