Bug Report: `connection pool timed out` errors when there is a spike in borrowed/waiting connections due to race condition #17662

mhamza15 · 2025-01-30T15:31:07Z

Overview of the Issue

There seems to be a race condition that causes a deadlock in connection pooling that occurs when a large number of connections are borrowed/waiting, specifically when there are no new connections afterwards. Here is the general flow, assuming a connection pool of size 1 for example:

"Thread" A borrows a connection from the pool
Thread B attempts to borrow a connection from the pool.
Some time after Thread B checks the pool but before it gets a chance to join the waitlist, Thread A completes and tries to pass its connection on to a waiter in the waitlist. As there are yet no waiters, it simply returns the connection to the pool
Thread B now joins the waitlist, but all connections are free and there are no existing connections to pass the connection from. Thread B blocks forever waiting for a new connection, the context times out, and we see our error code = ResourceExhausted desc = connection pool timed out.

Normally, in a live production system, a new query would come in, and a connection would be pulled straight from the pool, rather than waiting on an existing connection to pass it on. The new connection could then pass it on to Thread B, breaking the deadlock. But when it comes to our (GitHub) CI, the nature of our queries tends to cause the race condition more often, as we fire a bunch of queries all at once as part of a UNION ALL in our test cleanup code. These queries exceed the connection pool quickly, execute quickly, and cause the race condition. Since we're at the end of our test(s), no new queries are fired to pull a connection directly from the pool, and we wait forever.

Reproduction Steps

@arthurschreiber has come up with a test case that pretty consistently reproduces the error: #17661

Binary Version

main

Operating System and Environment details

all

Log Fragments

Trilogy::ProtocolError: 1203: target: github_test_repositories_actions_checks12.-80.primary: vttablet: rpc error: code = ResourceExhausted desc = connection pool timed out (CallerID: userData1) (trilogy_query_recv)

The text was updated successfully, but these errors were encountered:

deepthi · 2025-01-31T00:14:06Z

Thank you for opening this. Do you or @arthurschreiber intend to contribute a fix?

arthurschreiber · 2025-01-31T07:38:57Z

@deepthi We haven't come up with a fix yet. Our plan is to see if we can put something together this week, otherwise I guess I'll reach out to @vmg?

vmg · 2025-01-31T08:45:38Z

Appreciate the reproduction test! Will test locally and try to think of a fix. Cheers!

mhamza15 added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Jan 30, 2025

arthurschreiber added Component: VTTablet and removed Needs Triage This issue needs to be correctly labelled and triaged labels Jan 30, 2025

mhamza15 mentioned this issue Jan 30, 2025

add test case for waitlist race condition #17661

Draft

5 tasks

vmg linked a pull request Jan 31, 2025 that will close this issue

smartconnpool: do not allow connections to starve #17675

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: `connection pool timed out` errors when there is a spike in borrowed/waiting connections due to race condition #17662

Bug Report: `connection pool timed out` errors when there is a spike in borrowed/waiting connections due to race condition #17662

mhamza15 commented Jan 30, 2025 •

edited

Loading

deepthi commented Jan 31, 2025

arthurschreiber commented Jan 31, 2025

vmg commented Jan 31, 2025

Bug Report: connection pool timed out errors when there is a spike in borrowed/waiting connections due to race condition #17662

Bug Report: connection pool timed out errors when there is a spike in borrowed/waiting connections due to race condition #17662

Comments

mhamza15 commented Jan 30, 2025 • edited Loading

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

deepthi commented Jan 31, 2025

arthurschreiber commented Jan 31, 2025

vmg commented Jan 31, 2025

Bug Report: `connection pool timed out` errors when there is a spike in borrowed/waiting connections due to race condition #17662

Bug Report: `connection pool timed out` errors when there is a spike in borrowed/waiting connections due to race condition #17662

mhamza15 commented Jan 30, 2025 •

edited

Loading