JetStream consumer does not resume receiving messages after successful reconnect #1729

njkleiner · 2024-10-21T16:18:04Z

Observed behavior

I am experiencing a bug where a JetStream consumer will sometimes become "stuck" after a reconnect.

Concretely, a consumer will sometimes not resume receiving messages and instead reach a state
where it permanently throws "no heartbeat" errors after a successful reconnect to the server.

Expected behavior

The JetStream consumer should resume receiving messages after a successful reconnect (which it sometimes does).

Server and client version

I am using the main Branch of nats.go as of commit c7cf3452dd6359bdf40cbad0c39d900cbeba81e2.

Host environment

I am running these tests using Go 1.23.2 (darwin/arm64), with the asynctimerchan=1 GODEBUG setting enabled.

I am using the Consumer.Consume method and the following ConsumerConfig

jetstream.ConsumerConfig{
	Durable: "",

	AckPolicy:     jetstream.AckExplicitPolicy,
	DeliverPolicy: jetstream.DeliverAllPolicy,
}

Steps to reproduce

After a lot of manual debugging, I believe the issue is a race condition in the core NATS code,
where a call to Subscription.pCond.Wait will block forever, because no subsequent call to pCond.Signal or pCond.Broadcast ever occurs in spite of the interrupted connection having been successfully reconnected.

I have attached two example logs each that demonstrate a "stuck" consumer and a consumer that is not "stuck"
respectively. See the attached patch for the context w.r.t. the debug messages in these logs.

stuck2.log
notstuck2.log
stuck.log
notstuck.log

0001-add-debug-messages.patch

The text was updated successfully, but these errors were encountered:

piotrpio · 2024-10-25T08:37:33Z

Hello @njkleiner, thanks for creating the issue. I'll look at this, but in the meantime could you please check if your consumer is still there after the reconnect (at the point at which Consume is stuck)? You're using an ephemeral consumer (Durable is empty) and you don't explicitly set InactiveThreshold in your consumer config, so the server uses the default, which is 5 seconds. So if the client is disconnected and does not send pull requests for al least 5 seconds, thew consumer will be automatically deleted.

I'm obviously not saying that's the case and I'll be looking at this regardless, but this is a pretty common case so would be nice if you could check.

njkleiner · 2024-10-25T09:06:09Z

Sure, I can take a look.

But if this turns out to be the case, I would still argue that the client should not continue blocking indefinitely -- upon a successful reconnect -- when a consumer is deleted server side (and instead return an error immediately, on reconnect).

piotrpio · 2024-10-25T10:03:49Z

@njkleiner I agree with you, but there is not much we can do to make it better. We do not have a way of knowing the consumer has been deleted unless it happened during an active pull request, which results in us getting a Consumer Deleted status. But in case of reconnect we have no way of knowing unless we ask for consumer info on each reconnect, which is quite a heavy operation.

derekcollison · 2024-10-25T16:50:43Z

If the consumer does heartbeats the client will detect it is gone eventually.

njkleiner · 2024-10-25T17:23:29Z

I haven't had time to take a detailed look yet, but it appears that at least both of the example logs I have provided where the consumer gets "stuck" represent cases where there has been a disconnect of at least five seconds.

So you might be right about the consumer being deleted on the server side @piotrpio, I will investigate further next week.

@derekcollison I am not entirely sure how to interpret your comment. As I have originally stated, a "stuck" consumer eventually enters an indefinite state of heartbeat errors.

So the client detected it to be "gone" in that sense, but I have not treated these errors as unrecoverable so far -- my logic was that, if I assume an unstable connection where reconnects may occur, heartbeat timeouts are expected as well, and that successful reconnects would imply that heartbeat timeouts may, in principle, be recovered eventually as well.

Are you saying that indefinite heartbeat errors as described are indicative of a consumer that was deleted server side? And, if so, is there a way to distinguish these heartbeat errors from other (recoverable) heartbeat errors on the client side?

derekcollison · 2024-10-25T19:32:11Z

I think if the system can delete consumers out from underneath of apps, then the heartbeat should be required and iff the heartbeat fails, do a consumer info to determine if the consumer still exists, and if not take appropriate action to resolve.

njkleiner added the defect Suspected defect such as a bug or regression label Oct 21, 2024

piotrpio self-assigned this Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JetStream consumer does not resume receiving messages after successful reconnect #1729

JetStream consumer does not resume receiving messages after successful reconnect #1729

njkleiner commented Oct 21, 2024

piotrpio commented Oct 25, 2024

njkleiner commented Oct 25, 2024

piotrpio commented Oct 25, 2024

derekcollison commented Oct 25, 2024

njkleiner commented Oct 25, 2024

derekcollison commented Oct 25, 2024

JetStream consumer does not resume receiving messages after successful reconnect #1729

JetStream consumer does not resume receiving messages after successful reconnect #1729

Comments

njkleiner commented Oct 21, 2024

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

piotrpio commented Oct 25, 2024

njkleiner commented Oct 25, 2024

piotrpio commented Oct 25, 2024

derekcollison commented Oct 25, 2024

njkleiner commented Oct 25, 2024

derekcollison commented Oct 25, 2024