Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JsonRpcConnection: Don't drop client from cache prematurely #10210

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yhabteab
Copy link
Member

@yhabteab yhabteab commented Oct 31, 2024

PR #7445 incorrectly assumed that a peer that had already disconnected and never reconnected was due to the endpoint client being dropped after a successful socket shutdown. However, the issue at that time was that there was not a single timeout guards that could cancel the async_shutdown call, petentially blocking indefinetely. Although removing the client from cache early might have allowed the endpoint to reconnect, it did not resolve the underlying problem. Now that we have a proper cancellation timeout, we can wait until the currently used socket is fully closed before dropping the client from our cache. When our socket termination works reliably, the ApiListener reconnect timer should attempt to reconnect this endpoint after the next tick. Additionally, we now logs both before and after socket termination, which may help identify if it is hanging at any point in between.

Tests

[2024-10-31 10:14:11 +0100] information/JsonRpcConnection: Disconnecting API client for identity 'satellite'
[2024-10-31 10:14:11 +0100] warning/ApiListener: Removing API client for endpoint 'satellite'. 0 API clients left.
[2024-10-31 10:14:11 +0100] warning/ApiListener: Error while replaying log for endpoint 'satellite': Error: Cannot send message to already disconnected API client 'satellite'!

Context:

        (0) Replaying log for Endpoint 'satellite'
Context:
        (0) Replaying log for Endpoint 'satellite'

[2024-10-31 10:14:11 +0100] warning/JsonRpcConnection: API client disconnected for identity 'satellite'

@yhabteab yhabteab added enhancement New feature or request area/api REST API labels Oct 31, 2024
@yhabteab yhabteab added this to the 2.15.0 milestone Oct 31, 2024
@cla-bot cla-bot bot added the cla/signed label Oct 31, 2024
@icinga-probot icinga-probot bot added area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working core/build-fix Follow-up fix, not released yet labels Oct 31, 2024
@yhabteab yhabteab removed bug Something isn't working core/build-fix Follow-up fix, not released yet labels Oct 31, 2024
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need two warnings?

@yhabteab
Copy link
Member Author

Actually, no. However, I don't know which log level to use for which of these. So, if you have any better ideas, then please suggest!

@yhabteab yhabteab force-pushed the endpoint-client-dropped-early branch from 99be923 to d51d6a7 Compare October 31, 2024 11:27
@yhabteab
Copy link
Member Author

Do we actually need two warnings?

I now have degraded the first log to info.

@yhabteab yhabteab added the consider backporting Should be considered for inclusion in a bugfix release label Oct 31, 2024
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now as I'm thinking about it, in monitoring in general you use WARNING as a soon indicator (before CRITICAL) that something may be not right. But let's also let Julian come to word...

@julianbrost
Copy link
Contributor

I'm not really sure how well that comparison works. But yes, the question for choosing the log severity should also be "does it require attention?". So during a normal reload/config deployment, there ideally shouldn't be any warnings (I do know that we aren't there yet).

For a user, having both messages doesn't sound too helpful: like it says pretty much the same thing twice within at most 10 seconds, so I'd go even further and log the first one at notice.

Additionally, we now logs both before and after socket termination, which may help identify if it is hanging at any point in between.

That makes it sound like something that should only be necessary to debug very specific issues, not something that would be useful logging for every user.

ApiListener::GetInstance()->RemoveAnonymousClient(this);
}
Log(LogNotice, "JsonRpcConnection")
<< "Disconnecting API client for identity '" << m_Identity << "'";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 What about logging this...

@@ -208,18 +208,8 @@ void JsonRpcConnection::Disconnect()
JsonRpcConnection::Ptr keepAlive (this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... ASAP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would be useful. We're only concerned with whether the I/O threads are executing the coroutines correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that's why I think we should, while on it, log this even before the coroutine spawns.

PR #7445 incorrectly assumed that a peer that had already disconnected
and never reconnected was due to the endpoint client being dropped after
a successful socket shutdown. However, the issue at that time was that
there was not a single timeout guards that could cancel the `async_shutdown`
call, petentially blocking indefinetely. Although removing the client from
cache early might have allowed the endpoint to reconnect, it did not
resolve the underlying problem. Now that we have a proper cancellation
timeout, we can wait until the currently used socket is fully closed
before dropping the client from our cache. When our socket termination
works reliably, the `ApiListener` reconnect timer should attempt to
reconnect this endpoint after the next tick. Additionally, we now have
logs both for before and after socket termination, which may help
identify if it is hanging somewhere in between.
Al2Klimov
Al2Klimov previously approved these changes Nov 4, 2024
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@julianbrost
Copy link
Contributor

Neither the PR description nor the commit message really tell the purpose of moving the RemoveClient() call.

it did not resolve the underlying problem.

Here, underlying problem refers to the possibly blocking TLS shutdown without a timeout, i.e. something that's fixed already and not supposed to be fixed by this PR?

Now that we have a proper cancellation timeout, we can wait until the currently used socket is fully closed before dropping the client from our cache.

Why do we want to wait? Like the PR claims that currently it's premature but not why that's the case and what's improved by changing this.

@yhabteab
Copy link
Member Author

yhabteab commented Nov 6, 2024

Here, underlying problem refers to the possibly blocking TLS shutdown without a timeout, i.e. something that's fixed already and not supposed to be fixed by this PR?

Something that should have been fixed already!

Now that we have a proper cancellation timeout, we can wait until the currently used socket is fully closed before dropping the client from our cache.

Why do we want to wait? Like the PR claims that currently it's premature but not why that's the case and what's improved by changing this.

Firstly, you don't want to mark an endpoint as disconnected if it actually isn't. Before #7445, the shutdown flow was as it should be, i.e. first the socket is completely shut down, then the endpoint is marked as disconnected. However, this was changed with #7445 due to an incorrect assumption that disconnected clients being never reconnected again were due to this change, when in fact it might've been stuck somewhere in async_shutdown before reaching the end. So if for some reason the shutdown process gets stuck somewhere in between, we would never know, but assume that the node was completely disconnected, as the API listener would then try to reconnect to the client before fully ensuring that the previous socket was properly closed. This PR reverts this to the original form so that if we see a API client disconnected ... log entry, exactly know the shutdown process was successful.

@julianbrost
Copy link
Contributor

Firstly, you don't want to mark an endpoint as disconnected if it actually isn't.

In contrast, do you want to keep treating an endpoint as connected after the code decided that the connection is dead (for example in the "no messages received" case)? It's more in some in-between state of cleaning up the connection.

So if for some reason the shutdown process gets stuck somewhere in between, we would never know, but assume that the node was completely disconnected, as the API listener would then try to reconnect to the client before fully ensuring that the previous socket was properly closed.

In itself, that doesn't sound bad. Like once the connection is declared dead, it should be fine to establish a new one. It's just problematic if that would turn into a resource leak. Is that the main change here that if the code had a problem in Disconnect(), this PR would turn an invisible resource leak into a more visible "it fails to reconnect because it somehow hangs in Disconnect()?

This PR reverts this to the original form so that if we see a API client disconnected ... log entry, exactly know the shutdown process was successful.

Though that's not related to moving the RemoveClient() call but simply due to the change to the logging.

Interestingly, there seems to be quite a connection to #10005, so this might be yet another reason to revive that PR, quoting from that PR's description:

Open questions:

  • Do we want to try to call ForceDisconnect() directly in case a connection is shut down due to a timeout like "no messages received" on JSON-RPC connections?

Now thinking about this in context, my intuition says yes. If we consider a connection failed, why should we bother attempting a clean TLS shutdown instead of just killing it with fire? That would then also change things here as a (forceful) disconnect should be pretty much instant.

@yhabteab
Copy link
Member Author

yhabteab commented Nov 6, 2024

In contrast, do you want to keep treating an endpoint as connected after the code decided that the connection is dead (for example in the "no messages received" case)?

Yes, generally I would treat an endpoint as connected as long as its socket is not fully shut down, but the no message received case is something different that ideally should not happen that often.

Is that the main change here that if the code had a problem in Disconnect(), this PR would turn an invisible resource leak into a more visible "it fails to reconnect because it somehow hangs in Disconnect()?

Yes. If you just drop the client before anything else and log that it's disconnected when in fact it's not, we won't be able to tell for sure if the shutdown is complete afterwards. Unless there's something that we really messed up, the endpoint should be kept as connected for a maximum of 10s after someone requests a disconnect, so it shouldn't cause any other issues.

This PR reverts this to the original form so that if we see a API client disconnected ... log entry, exactly know the shutdown process was successful.

Though that's not related to moving the RemoveClient() call but simply due to the change to the logging.

When looking just at the log entry, then yes, but why should the endpoint be allowed to initiate another connection at all if the current one is not completely closed? I'm just trying to recreate theat least for me logical flow as how it should be, i.e. first either gracefully or forcibly close the current one before marking the endpoint as disconnected.

Now thinking about this in context, my intuition says yes. If we consider a connection failed, why should we bother attempting a clean TLS shutdown instead of just killing it with fire?

As I have already talked to you lately about that PR, I'm perfectly happy with it regardless of this one, and forcibly closing such a dead connection doesn't sound like a bad idea too.

That would then also change things here as a (forceful) disconnect should be pretty much instant.

What exactly would change here otherwise? For me, the referenced PR is just something I would include on top of this, but I don't see how those exclude one another.

@julianbrost
Copy link
Contributor

This PR reverts this to the original form so that if we see a API client disconnected ... log entry, exactly know the shutdown process was successful.

Though that's not related to moving the RemoveClient() call but simply due to the change to the logging.

When looking just at the log entry, then yes, but why should the endpoint be allowed to initiate another connection at all if the current one is not completely closed? I'm just trying to recreate theat least for me logical flow as how it should be, i.e. first either gracefully or forcibly close the current one before marking the endpoint as disconnected.

That's what I meant with saying it's in some in-between state. The rest of the code also makes a difference between connected and connecting, likewise there's a difference between disconnecting and disconnected (though I don't that distinction is done explicitly in the code).

Now the question is whether a new connection should already be attempted while in that disconnecting state. Currently it is done, you suggest to wait to be fully disconnected with this PR. I think neither is wrong and would probably tend towards the suggested change. You just use a quite high level of "should" here and whether the change is a good idea boils down to how sure we are that this doesn't delay the RemoveClient() call by much more than 10 seconds.

That would then also change things here as a (forceful) disconnect should be pretty much instant.

What exactly would change here otherwise? For me, the referenced PR is just something I would include on top of this, but I don't see how those exclude one another.

Ideally, that forceful disconnect would only be a close syscall1, so then it wouldn't make a noticeable difference whether you call RemoveClient() before or after that (in contrast to waiting 10 seconds for something to happen).

Footnotes

  1. Though that can't be done manually, Asio needs to be aware of this, so it has to be done using Asio.

@Al2Klimov
Copy link
Member

Ideally, that forceful disconnect would only be a close syscall1, so then it wouldn't make a noticeable difference whether you call RemoveClient() before or after that (in contrast to waiting 10 seconds for something to happen).

Footnotes

  1. Though that can't be done manually, Asio needs to be aware of this, so it has to be done using Asio.

Given this quote of yours, does anythings speak against not doing anything and letting the stream destructor do its thing?

@julianbrost
Copy link
Contributor

Given this quote of yours, does anythings speak against not doing anything and letting the stream destructor do its thing?

If the destructor does just the right thing, that could be the "Asio needs to be aware of this, so it has to be done using Asio." part.

@julianbrost
Copy link
Contributor

whether the change is a good idea boils down to how sure we are that this doesn't delay the RemoveClient() call by much more than 10 seconds.

Upon closer inspection, there still seem to be things that could block the disconnect.

This PR also moves the RemoveClient() call across this line:

m_WriterDone.Wait(yc);

That basically waits for the WriteOutgoingMessages() method to terminate (as it's set in a Defer):

void JsonRpcConnection::WriteOutgoingMessages(boost::asio::yield_context yc)
{
Defer signalWriterDone ([this]() { m_WriterDone.Set(); });
do {
m_OutgoingMessagesQueued.Wait(yc);
auto queue (std::move(m_OutgoingMessagesQueue));
m_OutgoingMessagesQueue.clear();
m_OutgoingMessagesQueued.Clear();
if (!queue.empty()) {
try {
for (auto& message : queue) {
size_t bytesSent = JsonRpc::SendRawMessage(m_Stream, message, yc);
if (m_Endpoint) {
m_Endpoint->AddMessageSent(bytesSent);
}
}
m_Stream->async_flush(yc);
} catch (const std::exception& ex) {
Log(m_ShuttingDown ? LogDebug : LogWarning, "JsonRpcConnection")
<< "Error while sending JSON-RPC message for identity '"
<< m_Identity << "'\n" << DiagnosticInformation(ex);
break;
}
}
} while (!m_ShuttingDown);
Disconnect();
}

That means that this PR changes the behavior so that a new connection is only attempted after that JsonRpc::SendRawMessage() and async_flush() finished, which might not happen for a long time on a dead connection.

There is #10216 which in combination with this PR would reduce it down to waiting for sending a single message or waiting for the async_flush(), but that can take too long on its own.

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #10210 (comment) (I should have submitted that as a review in the first place)

@Al2Klimov
Copy link
Member

Yes, but I'm wondering whether a m_Stream->lowest_layer().cancel(ec) could give the write loop the necessary kick in a such situation.

@yhabteab
Copy link
Member Author

yhabteab commented Nov 7, 2024

You just use a quite high level of "should" here and whether the change is a good idea boils down to how sure we are that this doesn't delay the RemoveClient() call by much more than 10 seconds.

I only used a high level of should because I can't tell you for sure that it will always be that way in every production system, because I can't and neither can you. I'm just saying that if there's no other hidden bug waiting to pop up, it definitely should take a maximum of 10s to fully disconnect. If that's not the case, we will need to dig for the actual problem, but claiming an endpoint is disconnected when it really isn't is not a solution you can live with.

To prove that PR #7445 did not solve the real problem, I just did a test where async_shutdown literally hangs indefinitely until a global timeout of 1h destroys the entire session and terminates the process. You don't have to ask for the test code I used, I'm just saying that async_shutdown would hang forever without the shutdown timeout and with it only for the specified time.

09:03:39: The SSL stream's async_shutdown method is hanging, and I didn't interrupt it
...
10:03:22 Exception thrown in coroutine - Thread stopped
Server::~Server()
Session::~Session()

So, merge or close it, it's entirely up to you!

@Al2Klimov Al2Klimov self-requested a review November 7, 2024 10:59
@yhabteab
Copy link
Member Author

yhabteab commented Nov 7, 2024

Sorry! I didn't see your intermediate comments while submitting my previous comment.

There is #10216 which in combination with this PR would reduce it down to waiting for sending a single message or waiting for the async_flush(), but that can take too long on its own.

And how long is the too long you are referring to? Why do you expect an endpoint to be instantly flagged as disconnected as soon as someone has requested a disconnect? If the current connection is still alive, we obviously need to flush its buffers first before closing it, and to me that's just the natural flow. But what advantages do you hope to gain from immediately cancelling the connection? To initiate a new connection while the current one is still sending data? And for what purpose? I don't get that.

I've already said in my previous comments that the no messages received disconnect is a special case that doesn't actually need a graceful shutdown, but simply terminate it as you've already suggested.

@julianbrost
Copy link
Contributor

And how long is the too long you are referring to?

That's the thing, there's no explicit timeout set. So however long it takes until the kernel decides itself that the socket is dead (which can take hours).

Why do you expect an endpoint to be instantly flagged as disconnected as soon as someone has requested a disconnect? If the current connection is still alive, we obviously need to flush its buffers first before closing it, and to me that's just the natural flow. But what advantages do you hope to gain from immediately cancelling the connection? To initiate a new connection while the current one is still sending data? And for what purpose? I don't get that.

I don't want it, I'm just saying there's a high chance that this PR actually reintroduces the problem described in #7444 (hence the request changes, so that it doesn't get merged before that is investigated as it was already approved).

I've already said in my previous comments that the no messages received disconnect is a special case that doesn't actually need a graceful shutdown, but simply terminate it as you've already suggested.

Not every disconnect is for this reason, but if there is one for this reason, reconnecting afterwards should work reliable.

@julianbrost
Copy link
Contributor

I'm just saying there's a high chance that this PR actually reintroduces the problem described in #7444

Indeed it does. I managed to trigger it with this PR by breaking a TCP connection by dropping all of its packets in a firewall (iptables -A INPUT -p tcp -s 172.18.0.33 --sport 49584 -j DROP).

Both nodes then logged that no messages were received in the last 60 seconds but didn't attempt to reconnect until 15 minutes later (not sure which timeout caused it to unblock, I'm not aware of related 15 minute timeouts in Icinga 2, so it probably was some kernel timeout).

[2024-11-07 13:20:46 +0100] information/JsonRpcConnection: No messages for identity 'satellite-b-2' have been received in the last 60 seconds.
[2024-11-07 13:35:24 +0100] warning/ApiListener: Removing API client for endpoint 'satellite-b-2'. 0 API clients left.
[...just work queue statistics and dumping program state...]
[2024-11-07 13:35:24 +0100] warning/JsonRpcConnection: API client disconnected for identity 'satellite-b-2'
[2024-11-07 13:35:26 +0100] information/ApiListener: Reconnecting to endpoint 'satellite-b-2' via host 'satellite-b-2' and port '5665'

Is that the main change here that if the code had a problem in Disconnect(), this PR would turn an invisible resource leak into a more visible "it fails to reconnect because it somehow hangs in Disconnect()?

So yes, it does exactly that, see also #10005 (comment) (describes the resource leak that happens when doing the same on the current master) 😅

Al2Klimov

This comment was marked as duplicate.

@Al2Klimov Al2Klimov dismissed their stale review November 7, 2024 16:46

Indeed. Too many connections are bad. But none at all "just" due to a, possibility little, resource leak are worse IMAO. Let's stall this.

@julianbrost julianbrost added the stalled Blocked or not relevant yet label Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api REST API area/distributed Distributed monitoring (master, satellites, clients) cla/signed consider backporting Should be considered for inclusion in a bugfix release enhancement New feature or request stalled Blocked or not relevant yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants