Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: connection pool timed out errors when there is a spike in borrowed/waiting connections due to race condition #17662

Open
mhamza15 opened this issue Jan 30, 2025 · 3 comments · May be fixed by #17675

Comments

@mhamza15
Copy link

mhamza15 commented Jan 30, 2025

Overview of the Issue

There seems to be a race condition that causes a deadlock in connection pooling that occurs when a large number of connections are borrowed/waiting, specifically when there are no new connections afterwards. Here is the general flow, assuming a connection pool of size 1 for example:

  1. "Thread" A borrows a connection from the pool
  2. Thread B attempts to borrow a connection from the pool.
  3. Some time after Thread B checks the pool but before it gets a chance to join the waitlist, Thread A completes and tries to pass its connection on to a waiter in the waitlist. As there are yet no waiters, it simply returns the connection to the pool
  4. Thread B now joins the waitlist, but all connections are free and there are no existing connections to pass the connection from. Thread B blocks forever waiting for a new connection, the context times out, and we see our error code = ResourceExhausted desc = connection pool timed out.

Normally, in a live production system, a new query would come in, and a connection would be pulled straight from the pool, rather than waiting on an existing connection to pass it on. The new connection could then pass it on to Thread B, breaking the deadlock. But when it comes to our (GitHub) CI, the nature of our queries tends to cause the race condition more often, as we fire a bunch of queries all at once as part of a UNION ALL in our test cleanup code. These queries exceed the connection pool quickly, execute quickly, and cause the race condition. Since we're at the end of our test(s), no new queries are fired to pull a connection directly from the pool, and we wait forever.

Reproduction Steps

@arthurschreiber has come up with a test case that pretty consistently reproduces the error: #17661

Binary Version

main

Operating System and Environment details

all

Log Fragments

Trilogy::ProtocolError: 1203: target: github_test_repositories_actions_checks12.-80.primary: vttablet: rpc error: code = ResourceExhausted desc = connection pool timed out (CallerID: userData1) (trilogy_query_recv)
@mhamza15 mhamza15 added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Jan 30, 2025
@arthurschreiber arthurschreiber added Component: VTTablet and removed Needs Triage This issue needs to be correctly labelled and triaged labels Jan 30, 2025
@deepthi
Copy link
Member

deepthi commented Jan 31, 2025

Thank you for opening this. Do you or @arthurschreiber intend to contribute a fix?

@arthurschreiber
Copy link
Contributor

@deepthi We haven't come up with a fix yet. Our plan is to see if we can put something together this week, otherwise I guess I'll reach out to @vmg?

@vmg
Copy link
Collaborator

vmg commented Jan 31, 2025

Appreciate the reproduction test! Will test locally and try to think of a fix. Cheers!

@vmg vmg linked a pull request Jan 31, 2025 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants