-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky Test: relay::tests::agones_token_router #877
Comments
On a failure, we do see the following:
Checking the logs for
Which shows that while it did fail to connect to the relay service, it eventually connected, which means it should have kicked over to being ready 🤔 |
Also, just in case I need it again, here's my test rig. Insert your own dev image. #!/bin/bash
set -eo pipefail
export RUST_BACKTRACE=1
export IMAGE_TAG=us-docker.pkg.dev/quilkin-mark-dev/release/quilkin:0.8.0-dev-2be3696
for i in {1..100}
do
echo "🔥 Test Run: $i"
cargo test --color=always --lib relay::tests::agones_token_router -- --nocapture --exact
done
|
Put some extra debugging in agent.rs so we could see which readiness health check is failing:
And it's But interestingly, this failure has no xDS connection issue on this failure, it something else. Side note: Might be useful for health/readiness endpoints to return a json packet as to why they are unhealthy. |
More debugging logging, current theory is that something is blocking in this block, before the code gets to here: Lines 428 to 430 in 2be3696
Next step is to narrow down what blocks before we can move to ready. |
End of day, but can confirm, this line here blocks and never returns in some circumstances. Line 427 in 2be3696
Will next dig into why that is, but @XAMPPRocky if you have suggestions also happy to take them. |
May be related: Line 262 in 8197bdd
Apparently is flaky as well. |
I don't unfortunately, I can't replicate in prod, I think it might just be a timing thing with the threads. |
No worries - digging in. At least I can reproduce it pretty reliably now. |
Just ran this unit test 300 times off |
Okay, this is tricky. Here's what a successful log looks like for an agent (with enhanced debugging from my branch): Full Log
Seeing if I can identify what should be happening when it's not. |
What I've seen in the the logs is usually stuff attempting to connect before a service is ready and then getting stuck in that. |
Tracking bug for flakiness in the Agones integration test.
Test seems to consistently fail here, when it does fail:
quilkin/agones/src/relay.rs
Line 312 in 6e0adc9
Something about the relay agent not coming up as ready for some reason.
The text was updated successfully, but these errors were encountered: