Akka shard region buffering messages #7411
Replies: 2 comments 2 replies
-
Do you see any warnings or info logs about buffered messages in the logs prior to the CPU hitting 100%? If the system was operating normally and then suddenly stops working, that could be the result of shard message buffering but you'd be seeing movement in the cluster first (i.e. scaling up, down, etc.) Does this issue occur when there's no movement first? |
Beta Was this translation helpful? Give feedback.
-
Hi Aaron, I was thinking to apply some back pressure logic when I encounter this situation, because it seems to be a state that the applicatino is not able to recover from. Do you have any suggestion, or any better idea on the strategy to apply in this situation? |
Beta Was this translation helpful? Give feedback.
-
Scenario:
Problem:
The application is stable for days, consuming and forwarding ~2500msg/s. From time to time, without any change in the messages volume or size, I see CPU spikes, and memory usage increse. The CPU goes to 100% (and stays at 100% for about 1h), the memory start slowly increasing until the application stops (on one node of the cluster - the one where the MqttActor is running), and the cluster is then re-balanced.
When the CPU is at 100% i see logs releated to the "heartbeat interval is growing too large" and "Scheduled sending of heartbeat was delayed"
When the CPU is at 100%, I see messages forwarded by ForwardActor has a drops and spkies, on the cluster nodes where the MqttActor is not running, while they are stable on the node where the MqttActor is running
My hypotheses, is that memory increase is due to the shard region buffering messages, because the other nodes cannot be considered reachable - because of the heartbeat is not stable.
Messages drop in ForwardActors that are not running in the same node of the MqttActor, could be due to the fact their messages were buffered. And the spikes could be due to the fact the node for a moment was considered reachable again - so messages in buffer were delivered to the actor.
My second hypotheses, is that CPU at 100%, is due to the shard region having to buffer messages, and then to forward when the other nodes are available.
This scenario would then be triggered by an "initial" CPU spike (for which I do not know the reason) and then the application is not able to recover until it stops itself.
Are the hypotheses plausible, and has anyone any idea why this could happen - or how to prevent this situation?
Beta Was this translation helpful? Give feedback.
All reactions