Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak / concurrency issues in short-running workers #2522

Open
sentry-io bot opened this issue Sep 6, 2024 · 8 comments
Open

Memory leak / concurrency issues in short-running workers #2522

sentry-io bot opened this issue Sep 6, 2024 · 8 comments
Assignees
Labels
area/general Related to whole service, not a specific part/integration. complexity/single-task Regular task, should be done within days. deployment Related to our deployment. kind/bug Something isn't working. sentry

Comments

@sentry-io
Copy link

sentry-io bot commented Sep 6, 2024

Sentry Issue: PCKT-002-PACKIT-SERVICE-7SS

Connection to Redis lost: Retry (17/20) in 1.00 second.
@mfocko mfocko self-assigned this Sep 6, 2024
@mfocko mfocko moved this from new to in-progress in Packit Kanban Board Sep 6, 2024
@mfocko
Copy link
Member

mfocko commented Sep 6, 2024

Initial investigation

Redict

Even though the issue appears to come from the Redict, the deployment is stable. As of time of writing this comment there is one pod deployed that's running since July 29th (with regards to the resources: requests={cpu=10m, memory=128Mi}, limits={cpu=10m, memory=256Mi}, based on the metrics usage fluctuates around 50Mi (dark green line in the graph below)).

Briefly checking the Redict deployment I notice an increasing trend of the connected clients, before opening this issue it was around 1500, currently as I'm writing this comment it's 3659. It may be related to the points below, however, the deployment is stable.

short-running workers

OTOH the same cannot be said about the short-running workers… I have doubled the memory of short-running workers on Monday (September 2nd) because of this issue. Doesn't seem to help, therefore I'm suspecting memory leak being present (the light green and orange/brown lines on the graph below, drops indicate the restart of the pod).

Stats (from the last 90 days):

  • affects production in 98 % of the cases
    Comment: related to the load, as opposed to the barely used stage

  • 42 % of these exceptions is caught in the short-running-0 and 33 % in short-running-1

    • minor occurrences in the service pod, looks like spikes (in the last 7 days only one, cannot be said about the short-running queue)

    Comment: could be related to the fact that short-running queue is handled concurrently (16 “threads”)

  • 76 % of the occurrences is caught in the process_message
    Comment: crime of opportunity, short-running handles webhooks and process_message

(following paragraphs speak mostly about the latest occurrence 2024-09-05 18:00-21:00 UTC)

Sentry events during the incriminating period1 don't reveal anything, one GitLab API exception and few failed RPM builds.

Logs during the incriminating period don't reveal anything either, there is actually a gap during the time when the memory usage spiked and caused a restart.

Issues for gevent, however, raise some suspicions:

https://github.com/gevent/gevent/issues?q=is%3Aissue+is%3Aopen+leak

I'm being suspicious of memory leak caused by the concurrency or wrongly terminated threads. Incorrectly killed threads don't explain the memory spike though.

Additionally, the fact that the forced restart of the short-running worker alleviates the issue supports the theory with the issue being caused by the worker itself.

Long-running workers are probably affected momentarily as the Celery maintains only one connection to the Redis.

Memory metrics (short-running workers and Redict)

Memory usage of short-running workers and redict

TODO

  • Check the list of connected clients to the Redict, there are age and idle attributes available that could corroborate the suspicion of threads not being killed off successfully
  • Monitor the connected clients to the Redict (ideally Prometheus if possible), is not integral to running the service, only helpful temporarily

Footnotes

  1. spike in the memory usage eventually causing the restart of the pod

@mfocko mfocko changed the title Redict out of memory and restarting Memory leak / concurrency issues in short-running workers Sep 6, 2024
@mfocko mfocko added kind/bug Something isn't working. deployment Related to our deployment. complexity/single-task Regular task, should be done within days. area/general Related to whole service, not a specific part/integration. sentry labels Sep 6, 2024
@mfocko
Copy link
Member

mfocko commented Sep 23, 2024

Status update

On Friday I replaced Redict with Valkey, workers got redeployed.

I've been checking the count of connected clients here and there:

timestamp connected clients
after redeploy (Friday) ~300
Sunday @ 20:06 UTC 4650
Sunday @ 20:49 UTC 4700
Monday @ 07:17 UTC 6308
Monday @ 10:49 UTC 8091

Based on the observation, rescaling of the workers dropped the amount of connections, the issues is present across different deployments (e.g., Redis, Redict, Valkey).

Posting list of the connected clients before experimenting with queues

@mfocko
Copy link
Member

mfocko commented Sep 23, 2024

2º update

To pinpoint the issues more precisely, I've rescaled the workers while watching the stats from the Valkey.

Queue Before scaling down After scaling down After scaling up
long-running 8195 8169 8191
short-running 8207 88 111

OpenShift Metrics:

  • long-running
    image
  • short-running
    image

The issue is definitely coming from short-running workers… Based on the previous findings:

  • short-running pods run out of memory which causes restart
  • Redis/Redict/Valkey also runs out of free connection slots

I assume that running out of connection slots is a side effect related to the memory leak that causes restart. This could be caused by failed clean up of the concurrent threads in the short-running workers (holds onto both allocated memory, and open connection to Valkey).

I also suspected bug in the Celery client that fails to properly clean up the session afterwards, but this doesn't align with the memory issue, i.e., there would be open connections, but memory should've been cleaned up.

Next steps

  • Rule out garbage collector as an issue, i.e., trigger the garbage collection manually and watch what happens
  • if the GC is not causing the issue, i.e., manually triggering GC doesn't do anything (neither memory, nor Valkey connections drop), continue investigating further

Captured output from the valkey-cli
# Clients (before scaling down long-running)
connected_clients:8195
maxclients:10000
client_recent_max_input_buffer:32768
client_recent_max_output_buffer:0
blocked_clients:4
pubsub_clients:9
clients_in_timeout_table:4
total_blocking_keys:8
127.0.0.1:6379> info clients

# Clients (after scaling down long-running)
connected_clients:8169
maxclients:10000
client_recent_max_input_buffer:24576
client_recent_max_output_buffer:0
blocked_clients:2
pubsub_clients:5
clients_in_timeout_table:2
total_blocking_keys:4
127.0.0.1:6379> info clients

# Clients (after scaling up long-running)
connected_clients:8191
maxclients:10000
client_recent_max_input_buffer:65536
client_recent_max_output_buffer:0
blocked_clients:4
pubsub_clients:7
clients_in_timeout_table:4
total_blocking_keys:12

# Clients (before scaling down short-running)
connected_clients:8207
maxclients:10000
client_recent_max_input_buffer:49152
client_recent_max_output_buffer:0
blocked_clients:4
pubsub_clients:9
clients_in_timeout_table:4
total_blocking_keys:8

# Clients (after scaling down short-running)
connected_clients:88
maxclients:10000
client_recent_max_input_buffer:32768
client_recent_max_output_buffer:0
blocked_clients:0
pubsub_clients:10
clients_in_timeout_table:0
total_blocking_keys:0

# Clients (after scaling up short-running)
connected_clients:111
maxclients:10000
client_recent_max_input_buffer:32768
client_recent_max_output_buffer:0
blocked_clients:1
pubsub_clients:9
clients_in_timeout_table:1
total_blocking_keys:4

@mfocko
Copy link
Member

mfocko commented Sep 24, 2024

Today the short-running pod died before it could use up all the Valkey connections, so it didn’t spam the Sentry.

@mfocko
Copy link
Member

mfocko commented Sep 26, 2024

Testing on prod pt. 2

127.0.0.1:6379> CONFIG GET timeout
1) "timeout"
2) "0"
127.0.0.1:6379> CONFIG SET timeout 3600
OK
127.0.0.1:6379> INFO CLIENTS
# Clients
connected_clients:1430
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:192
client_recent_max_output_buffer:0
blocked_clients:2
tracking_clients:0
pubsub_clients:9
watching_clients:0
clients_in_timeout_table:2
total_watched_keys:0
total_blocking_keys:4
total_blocking_keys_on_nokey:0

Before adjusting the timeout there was between 4k-6k clients, so even the 1hr timeout seems reasonable.

Checking the client list posted in some comment above, most of the clients have age == idle with the last command being UNSUBSCRIBE (and usually just 3 commands executed, SUB, “something”, and UNSUB)

Going through the Redis docs I found:

Even if by default connections are not subject to timeout, there are two conditions when it makes sense to set a timeout:

  • Mission critical applications where a bug in the client software may saturate the Redis server with idle connections, causing service disruption.
  • As a debugging mechanism in order to be able to connect with the server if a bug in the client software saturates the server with idle connections, making it impossible to interact with the server.

Gotta love the first point…


TODO

  • If it works, propagate the timeout configuration into the deployment
  • Check whether it helps with the memory leaks, or just alleviates the Redis/Redict/Valkey instance…

@mfocko
Copy link
Member

mfocko commented Sep 27, 2024

Looks OK so far, after brief inspection of the Valkey client list, lowering the timeout further to 1800 (30m * 60s), as there is still a considerable amount of connections that have age == idle and are older than 30 minutes.

Before switching to 3600 (1 hour), we hung around 600-800 connections, right now (clean up is iterative = not all clients that pass the timeout get cleaned up immediately) we are at 297 connections.


As for the short-running workers, I don't really see a noticeable difference

image

(red line indicates the setting of the timeout)

Last restart happened yesterday, even after setting the timeout on the Valkey and there appears to be an increasing trend for the used memory, therefore it doesn't appear that it helps in any way with the issue of the short-running workers. However, since the Valkey cleans up the connections, it doesn't cause DoS from running out of the Valkey connections…

@mfocko
Copy link
Member

mfocko commented Oct 1, 2024

Currently hanging around 550 clients in Valkey, will open a PR to have the timeout configured in our Valkey/Redict/Redis deployment.

mfocko added a commit to mfocko/deployment that referenced this issue Oct 8, 2024
The leaks in short-running pods result in idle connections
to Redict/Valkey, all of these KV-databases have ‹timeout› option in
their config that allows for iterative cleanup of hanging connections.

This mitigates the issue to the point of still having free connections
slots to Redict/Valkey, i.e., the pods shall be killed, but handlers do
not end up in a retry-loop trying to connect to Redict/Valkey.

Since the config is 1:1 between all Redis, Redict, and Valkey, create
one ConfigMap, map the config into the databases and pass the path
to the config as an argument.

Tested with Redict and Valkey. »NOT« tested with Redis.

Related to packit/packit-service#2522

Signed-off-by: Matej Focko <[email protected]>
mfocko added a commit to mfocko/deployment that referenced this issue Oct 8, 2024
The leaks in short-running pods result in idle connections
to Redict/Valkey, all of these KV-databases have ‹timeout› option in
their config that allows for iterative cleanup of hanging connections.

This mitigates the issue to the point of still having free connections
slots to Redict/Valkey, i.e., the pods shall be killed, but handlers do
not end up in a retry-loop trying to connect to Redict/Valkey.

Since the config is 1:1 between all Redis, Redict, and Valkey, create
one ConfigMap, map the config into the databases and pass the path
to the config as an argument.

Tested with Redict and Valkey. »NOT« tested with Redis.

Related to packit/packit-service#2522

Signed-off-by: Matej Focko <[email protected]>
mfocko added a commit to packit/deployment that referenced this issue Oct 9, 2024
- [x] Deployed on both stage and prod… :PepeLaugh:
   > Friday evening deployments hit different…

Related to packit/packit-service#2522
@majamassarini
Copy link
Member

Fixing this should also solve #2427

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/general Related to whole service, not a specific part/integration. complexity/single-task Regular task, should be done within days. deployment Related to our deployment. kind/bug Something isn't working. sentry
Projects
Status: in-progress
Development

No branches or pull requests

2 participants