-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dispatcher deadlock occurs when the rabbitmq server is down while the dispatcher is waiting for the response #1369
Labels
kind/bug
Categorizes issue or PR as related to a bug.
Comments
cool story, Appreciate your dedication. Dead locks are certainly the most favorite of all bugs for me :-) I left a small request in the PR. Thanks! |
ikavgo
pushed a commit
to rabbitmq/amqp091-go
that referenced
this issue
Apr 9, 2024
ikavgo
pushed a commit
to rabbitmq/amqp091-go
that referenced
this issue
Apr 9, 2024
ikavgo
pushed a commit
to rabbitmq/amqp091-go
that referenced
this issue
Apr 9, 2024
FYI - rabbitmq/amqp091-go#256 |
Yes I am. |
Since the PR is merge, the issue is fixed for me, I will close this issue. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It is easy to spot this bug if the following 2 criteria are fulfilled.
(Luckily, or unluckily, my testing environment meets both of them.)
Describe the bug
In the beginning, I noticed an issue of the dispatcher. Every time when the rabbitmq server goes down, and stays down for a short time, then comes back up , the dispatcher can't restart the consuming. It can reconnect to the rabbitmq server, but just can't restart the consuming. If check on the rabbitmq server side (like the management UI), will see there is no consumer for the queue.
This issue kept happening in my test env, which drove me to investigate it.
With many debugging logs added, I narrowed down the suspicious code, which is in the file
pkg/dispatcher/dispatcher.go
.Connection.NotifyClose()
orChannel.NotifyClose()
will become readable;finishConsuming()
get called, and in turn,wg.Wait()
would be called to wait for the dispatch goroutine to finish;dispatch()
which is called in the goroutine would get stuck without returning.dispatch()
is found to be stuck at the call to themsg.Ack()
watchRabbitMQConnections()
would create a new connection), according to the doc ofamqp091
, the call tomsg.Ack()
is supposed to return an error to indicate the connection is lost, rather than blocking there.(What makes the
msg.Ack()
block?)Expected behavior
Every time when the rabbitmq server goes down, and stays down for no matter how long, then comes back up , the dispatcher can restart the consuming.
To Reproduce
NotifyClose
channel becomes readable, the functionfinishConsuming()
is called.Knative release version
From 1.10.0 to 1.13.0. (I didn't test the older version. I think they might be the same.)
Additional context
In the end, I managed to figure out the root cause, which is a deadlock that blocks the
msg.Ack()
. The deadlock occurs this way.The RabbitMQ server, the remote peer, shuts down
The first deadlock occurs here.
** Sending the error to the
Channel.NotifyClose
channel would block **. Because there is no receiver on theChannel.NotifyClose
channel.Why is there no receiver on the
Channel.NotifyClose
? Also see the piece of code above. As we can see, the receiver of theChannel.NotifyClose
channel is this one,case <-chNotifyChannel:
. It will never be reached, because another receiver,case <- connNotifyChannel
, has been executed, and as the result, the process is either blocked in thefinishConsuming()
or returned from the current function.(BTW., This deadlock blocks one of the sub goroutine that is calling the "shutdown" function. Because it doesn't block the main goroutine, so it doesn't look like that harmful.)
The final consequence for this part is
Channel.shutdown()
couldn't return, because the deadlockch.m.Lock()
couldn't be unlocked.So obviously, this is the second deadlock.
To sum up, the whole deadlock thing is like this
amqp091
library sends the error to the unbuffered channel registered byConnection.NotifyClose()
andChannel.NotifyClose()
Connection.NotifyClose
channel, it won't read theChannel.NotifyClose
channel any longer, that blocks the sender,Channel.shutdown()
Channel.shutdown()
being blocked makes the mutexch.m.Lock()
stay lockedch.m.Lock()
staying locked blocks themsg.Ack()
msg.Ack()
blocks the dispatch goroutine, so thewg.Done()
will never be called.finishConsuming()
at the call towg.Wait()
.finishConsuming()
, in turn, blocks its caller, theConsumeFromQueue()
, that eventually results in the dispatcher being unable to restart the consumingBy referring to the doc of
amqp091
, I believe the deadlock is out of a kind of improper use of theNotifyClose
channel.Best practises for Connection and Channel notifications
So I have come up with my own fix. A PR will be created soon.
The text was updated successfully, but these errors were encountered: