Memory leak / concurrency issues in short-running workers #2522

sentry-io · 2024-09-06T11:48:50Z

Sentry Issue: PCKT-002-PACKIT-SERVICE-7SS

Connection to Redis lost: Retry (17/20) in 1.00 second.

The text was updated successfully, but these errors were encountered:

mfocko · 2024-09-06T13:35:45Z

Initial investigation

Redict

Even though the issue appears to come from the Redict, the deployment is stable. As of time of writing this comment there is one pod deployed that's running since July 29th (with regards to the resources: requests={cpu=10m, memory=128Mi}, limits={cpu=10m, memory=256Mi}, based on the metrics usage fluctuates around 50Mi (dark green line in the graph below)).

Briefly checking the Redict deployment I notice an increasing trend of the connected clients, before opening this issue it was around 1500, currently as I'm writing this comment it's 3659. It may be related to the points below, however, the deployment is stable.

`short-running` workers

OTOH the same cannot be said about the short-running workers… I have doubled the memory of short-running workers on Monday (September 2nd) because of this issue. Doesn't seem to help, therefore I'm suspecting memory leak being present (the light green and orange/brown lines on the graph below, drops indicate the restart of the pod).

Stats (from the last 90 days):

affects production in 98 % of the cases
Comment: related to the load, as opposed to the barely used stage
42 % of these exceptions is caught in the short-running-0 and 33 % in short-running-1
- minor occurrences in the service pod, looks like spikes (in the last 7 days only one, cannot be said about the short-running queue)
Comment: could be related to the fact that short-running queue is handled concurrently (16 “threads”)
76 % of the occurrences is caught in the process_message
Comment: crime of opportunity, short-running handles webhooks and process_message

(following paragraphs speak mostly about the latest occurrence 2024-09-05 18:00-21:00 UTC)

Sentry events during the incriminating period¹ don't reveal anything, one GitLab API exception and few failed RPM builds.

Logs during the incriminating period don't reveal anything either, there is actually a gap during the time when the memory usage spiked and caused a restart.

Issues for gevent, however, raise some suspicions:

https://github.com/gevent/gevent/issues?q=is%3Aissue+is%3Aopen+leak

I'm being suspicious of memory leak caused by the concurrency or wrongly terminated threads. Incorrectly killed threads don't explain the memory spike though.

Additionally, the fact that the forced restart of the short-running worker alleviates the issue supports the theory with the issue being caused by the worker itself.

Long-running workers are probably affected momentarily as the Celery maintains only one connection to the Redis.

Memory metrics (`short-running` workers and Redict)

TODO

Check the list of connected clients to the Redict, there are age and idle attributes available that could corroborate the suspicion of threads not being killed off successfully
Monitor the connected clients to the Redict (ideally Prometheus if possible), is not integral to running the service, only helpful temporarily

spike in the memory usage eventually causing the restart of the pod ↩

mfocko · 2024-09-23T10:55:44Z

Status update

On Friday I replaced Redict with Valkey, workers got redeployed.

I've been checking the count of connected clients here and there:

timestamp	connected clients
after redeploy (Friday)	~300
Sunday @ 20:06 UTC	4650
Sunday @ 20:49 UTC	4700
Monday @ 07:17 UTC	6308
Monday @ 10:49 UTC	8091

Based on the observation, rescaling of the workers dropped the amount of connections, the issues is present across different deployments (e.g., Redis, Redict, Valkey).

Posting list of the connected clients before experimenting with queues

mfocko · 2024-09-23T11:13:04Z

2º update

To pinpoint the issues more precisely, I've rescaled the workers while watching the stats from the Valkey.

Queue	Before scaling down	After scaling down	After scaling up
long-running	8195	8169	8191
short-running	8207	88	111

OpenShift Metrics:

long-running
short-running

The issue is definitely coming from short-running workers… Based on the previous findings:

short-running pods run out of memory which causes restart
Redis/Redict/Valkey also runs out of free connection slots

I assume that running out of connection slots is a side effect related to the memory leak that causes restart. This could be caused by failed clean up of the concurrent threads in the short-running workers (holds onto both allocated memory, and open connection to Valkey).

I also suspected bug in the Celery client that fails to properly clean up the session afterwards, but this doesn't align with the memory issue, i.e., there would be open connections, but memory should've been cleaned up.

Next steps

Rule out garbage collector as an issue, i.e., trigger the garbage collection manually and watch what happens
if the GC is not causing the issue, i.e., manually triggering GC doesn't do anything (neither memory, nor Valkey connections drop), continue investigating further

Captured output from the valkey-cli

# Clients (before scaling down long-running)
connected_clients:8195
maxclients:10000
client_recent_max_input_buffer:32768
client_recent_max_output_buffer:0
blocked_clients:4
pubsub_clients:9
clients_in_timeout_table:4
total_blocking_keys:8
127.0.0.1:6379> info clients

# Clients (after scaling down long-running)
connected_clients:8169
maxclients:10000
client_recent_max_input_buffer:24576
client_recent_max_output_buffer:0
blocked_clients:2
pubsub_clients:5
clients_in_timeout_table:2
total_blocking_keys:4
127.0.0.1:6379> info clients

# Clients (after scaling up long-running)
connected_clients:8191
maxclients:10000
client_recent_max_input_buffer:65536
client_recent_max_output_buffer:0
blocked_clients:4
pubsub_clients:7
clients_in_timeout_table:4
total_blocking_keys:12

# Clients (before scaling down short-running)
connected_clients:8207
maxclients:10000
client_recent_max_input_buffer:49152
client_recent_max_output_buffer:0
blocked_clients:4
pubsub_clients:9
clients_in_timeout_table:4
total_blocking_keys:8

# Clients (after scaling down short-running)
connected_clients:88
maxclients:10000
client_recent_max_input_buffer:32768
client_recent_max_output_buffer:0
blocked_clients:0
pubsub_clients:10
clients_in_timeout_table:0
total_blocking_keys:0

# Clients (after scaling up short-running)
connected_clients:111
maxclients:10000
client_recent_max_input_buffer:32768
client_recent_max_output_buffer:0
blocked_clients:1
pubsub_clients:9
clients_in_timeout_table:1
total_blocking_keys:4

mfocko · 2024-09-24T16:15:20Z

Today the short-running pod died before it could use up all the Valkey connections, so it didn’t spam the Sentry.

mfocko · 2024-09-26T07:33:35Z

Testing on prod pt. 2

127.0.0.1:6379> CONFIG GET timeout
1) "timeout"
2) "0"
127.0.0.1:6379> CONFIG SET timeout 3600
OK
127.0.0.1:6379> INFO CLIENTS
# Clients
connected_clients:1430
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:192
client_recent_max_output_buffer:0
blocked_clients:2
tracking_clients:0
pubsub_clients:9
watching_clients:0
clients_in_timeout_table:2
total_watched_keys:0
total_blocking_keys:4
total_blocking_keys_on_nokey:0

Before adjusting the timeout there was between 4k-6k clients, so even the 1hr timeout seems reasonable.

Checking the client list posted in some comment above, most of the clients have age == idle with the last command being UNSUBSCRIBE (and usually just 3 commands executed, SUB, “something”, and UNSUB)

Going through the Redis docs I found:

Even if by default connections are not subject to timeout, there are two conditions when it makes sense to set a timeout:

Mission critical applications where a bug in the client software may saturate the Redis server with idle connections, causing service disruption.

As a debugging mechanism in order to be able to connect with the server if a bug in the client software saturates the server with idle connections, making it impossible to interact with the server.

Gotta love the first point…

TODO

If it works, propagate the timeout configuration into the deployment
Check whether it helps with the memory leaks, or just alleviates the Redis/Redict/Valkey instance…

mfocko · 2024-09-27T12:46:05Z

Looks OK so far, after brief inspection of the Valkey client list, lowering the timeout further to 1800 (30m * 60s), as there is still a considerable amount of connections that have age == idle and are older than 30 minutes.

Before switching to 3600 (1 hour), we hung around 600-800 connections, right now (clean up is iterative = not all clients that pass the timeout get cleaned up immediately) we are at 297 connections.

As for the short-running workers, I don't really see a noticeable difference

(red line indicates the setting of the timeout)

Last restart happened yesterday, even after setting the timeout on the Valkey and there appears to be an increasing trend for the used memory, therefore it doesn't appear that it helps in any way with the issue of the short-running workers. However, since the Valkey cleans up the connections, it doesn't cause DoS from running out of the Valkey connections…

mfocko · 2024-10-01T11:07:14Z

Currently hanging around 550 clients in Valkey, will open a PR to have the timeout configured in our Valkey/Redict/Redis deployment.

The leaks in short-running pods result in idle connections to Redict/Valkey, all of these KV-databases have ‹timeout› option in their config that allows for iterative cleanup of hanging connections. This mitigates the issue to the point of still having free connections slots to Redict/Valkey, i.e., the pods shall be killed, but handlers do not end up in a retry-loop trying to connect to Redict/Valkey. Since the config is 1:1 between all Redis, Redict, and Valkey, create one ConfigMap, map the config into the databases and pass the path to the config as an argument. Tested with Redict and Valkey. »NOT« tested with Redis. Related to packit/packit-service#2522 Signed-off-by: Matej Focko <[email protected]>

- [x] Deployed on both stage and prod… :PepeLaugh: > Friday evening deployments hit different… Related to packit/packit-service#2522

majamassarini · 2024-10-29T08:43:21Z

Fixing this should also solve #2427

usercont-release-bot added this to Packit Kanban Board Sep 6, 2024

github-project-automation bot moved this to new in Packit Kanban Board Sep 6, 2024

mfocko self-assigned this Sep 6, 2024

mfocko moved this from new to in-progress in Packit Kanban Board Sep 6, 2024

mfocko changed the title ~~Redict out of memory and restarting~~ Memory leak / concurrency issues in short-running workers Sep 6, 2024

mfocko added kind/bug Something isn't working. deployment Related to our deployment. complexity/single-task Regular task, should be done within days. area/general Related to whole service, not a specific part/integration. sentry labels Sep 6, 2024

mfocko mentioned this issue Sep 20, 2024

Feat/valkey support packit/deployment#597

Merged

1 task

mfocko added a commit to packit/deployment that referenced this issue Oct 9, 2024

Feat/valkey support (#597)

c911c4f

- [x] Deployed on both stage and prod… :PepeLaugh: > Friday evening deployments hit different… Related to packit/packit-service#2522

mfocko mentioned this issue Oct 25, 2024

Hard time limit ... exceeded (followup) #2603

Open

majamassarini mentioned this issue Oct 29, 2024

WorkerLostError: #2427

Open

mfocko mentioned this issue Jan 9, 2025

New Celery queue to debug CPU/memory issues #2693

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak / concurrency issues in short-running workers #2522

Memory leak / concurrency issues in short-running workers #2522

sentry-io bot commented Sep 6, 2024

mfocko commented Sep 6, 2024

mfocko commented Sep 23, 2024

mfocko commented Sep 23, 2024 •

edited

Loading

mfocko commented Sep 24, 2024

mfocko commented Sep 26, 2024 •

edited

Loading

mfocko commented Sep 27, 2024 •

edited

Loading

mfocko commented Oct 1, 2024

majamassarini commented Oct 29, 2024

Memory leak / concurrency issues in short-running workers #2522

Memory leak / concurrency issues in short-running workers #2522

Comments

sentry-io bot commented Sep 6, 2024

mfocko commented Sep 6, 2024

Initial investigation

Redict

short-running workers

Memory metrics (short-running workers and Redict)

TODO

Footnotes

mfocko commented Sep 23, 2024

Status update

mfocko commented Sep 23, 2024 • edited Loading

2º update

Next steps

mfocko commented Sep 24, 2024

mfocko commented Sep 26, 2024 • edited Loading

Testing on prod pt. 2

TODO

mfocko commented Sep 27, 2024 • edited Loading

mfocko commented Oct 1, 2024

majamassarini commented Oct 29, 2024

`short-running` workers

Memory metrics (`short-running` workers and Redict)

mfocko commented Sep 23, 2024 •

edited

Loading

mfocko commented Sep 26, 2024 •

edited

Loading

mfocko commented Sep 27, 2024 •

edited

Loading