Replies: 2 comments 24 replies
-
Some extra info:
|
Beta Was this translation helpful? Give feedback.
24 replies
-
Maybe it's a similar issue here? We are using nats Versions: Consumer was not sending messages:
While the Stream has a sequence number of ~5m. After recreating the consumer we got messages again and the stats were plausible again:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi All,
I am doing some chaos engineering and can pretty reliably replicate a situation where a message in a work queue seems to be stuck, i.e. it's in the queue but not picked up by the queue consumer. Messages before it were all picked up correctly, and newly published messages after it are also being picked up correctly.
To get into this situation I load, say, 100 messages into the queue, then start a JVM process which is processing the items (using Kotlin, in a codebase heavily using coroutines, in case it ends up being relevant). This process picks up multiple messages and processes them concurrently. In that process I cause a
java.lang.OutOfMemoryError
to occur in some thread. In response to this we shut down all running coroutines, NAKing the messages that are in progress, and unsubscribe the JetStream subscription. The JVM process runs for a few seconds, processes some of the items, then dies as expected. If I run it a few times eventually it manages to process all the items in the queue, but one or two are always left in the queue. If I then remove theOutOfMemoryError
and leave the process up and running, it never picks up the message, no matter how long it's left running.Because this is simulating a crash of the system, it's possible that there are some race conditions between threads NAKing the messages they're processing, and the subscription being unsubscribed. But I'd expect the system to eventually recover and for the message to eventually be made available to some new consumer that starts up. However this doesn't seem to happen. It seems to remain stuck in the queue.
Does anyone have any ideas how to start troubleshooting this?
Note that if I stop the Java process, remove the consumer, then start up the Java process (which programmatically re-creates the consumer), then this time it does pick up the "stuck" message.
I'm using NATS 2.10-alpine, just running in Docker Compose locally at the moment. Below are some commands I ran that show information that might be pertinent...
nats stream info
nats consumer info
nats consumer report
nats stream view
(For the latter I had to stop the service and delete the consumer otherwise I hit the error described in this issue: nats-io/natscli#920)
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions