-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Actually disconnect on timeout #172
base: master
Are you sure you want to change the base?
Actually disconnect on timeout #172
Conversation
@Drahflow can you ameliorate your commit description ? b/c sorry but I can't understand it. Maybe start with "What is going on ?" "This lead to what ?" "How to fix it ?" . |
lymph.core.connection.Connection pings the endpoint regularly to determine whether it is still live. `__init__` takes a `unresponsive_disconnect` parameter, which was however ignored so far. Similarly, the `idle_disconnect` parameter did not have any effect. This commit uses thees two parameters to do what they advertise, i.e. close the connection once the respective timeouts have been reached. To this end, the greenlet management logic was also changed, such that the monitoring greenlets now commit suicide when the connection `status == CLOSED` instead of being explicitely killed from `close()`. Without this change, the monitoring greenlet would inadvertently kill itself, leading to (very nonobvious) dead code and leaving a half-dead connection object still registered with the server.
cf3e737
to
3584433
Compare
This makes sense for me in general (it's basically what |
@@ -84,8 +84,14 @@ def live_check_loop(self): | |||
def update_status(self): | |||
if self.last_seen: | |||
now = time.monotonic() | |||
if now - self.last_seen >= self.timeout: | |||
|
|||
if now - self.last_seen >= self.unresponsive_disconnect: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout should only apply once the connection status is UNRESPONSIVE.
We should track the time since the last status change for disconnect timeouts. |
Do we win anything by the status age tracking, except that the timeouts would be Philosophically speaking, it IMHO makes more sense to have the *_disconnect parameters denote the time since unresponsiveness / idleness began (which is indeed before the status change). Since you designed this, I'll follow your suggestions, unless I hear differently from you. |
c9391d5
to
b4d5b3a
Compare
@emulbreh @Drahflow Why are we doing this ? removing a connection when it's idle is very weird to me ! i.e. if we don't talk to a given instance for more than 30 seconds we don't need the connection ? is this just some optimisation thing ? And the main question how does this fit with zookeeper heartbeat ? aren't we already disconnecting from an instance when this later go down or the process is unresponsive ? beside how does we define unresponsiveness inside lymph ? Since I don't think we add anything on zookeeper heartbeat. Beside this heartbeat thingy the more I think about it the more I dislike it since it assume a lot of stuff about the network where services are run ! e.g. timeout is 3 seconds i.e. in case of a latency burst we will mark connections as unresponsive which is false ! |
I personally don't care for disconnect on idle, it probably doesn't hurt, nor help much - I just implemented the logic for the already existing constructor parameters. However, I do care a lot for disconnect on unresponsiveness:
|
What do you mean by "Not all of them are covered by sending zookeeper heartbeats" ? can you elaborate ? I don't see how our heartbeat algorithm is anyway more sophisticated than zookeeper heartbeat. AFAIK heartbeat is used to check process liveness and not for checking if there is a latency burst or no, b/c something that the later will happen and you don't want to just stop and fail all request just b/c your network is having a bad moment. Thoughts ? |
Re 1.: It's about the other way around. We have a usual zookeeper-registered instance A. Then I do Re 2.: Nothing against the circuit breaker. But also the breaker needs some event to trigger on. Do you propose to use the proxy timeouts for this? What I meant by "not everything is covered by zookeeper": I think we are sending regular zookeeper heartbeats via some greenlet or whatever. However this effectively only detects whether our entire process or machine is overloaded, not whether we are experiencing inacceptable latency on a connection. (The former will imply the latter, but not the other way around.) Consider we are having some request which takes, say 100ms to serve (instead of the expected 1ms). Now the client code does 1000 of them into the same connection. Afterwards, it'll send a ping. After 1sec, the service will have completed 1% of the requests, and send a zookeeper heartbeat. Only after 100sec is the service ready to be useful again... Combine this with asynchronous requests on the client and I think the next request of the client should go somewhere else, if possible. Same problem if something ugly happened on the TCP level, like network outage only between the client and the service, but not between each of them and zookeeper. |
Yes, kazoo do the heartbeat in another greenlet, and your calculation are if I understand them correctly missing the fact that requests are run concurrently so is the heartbeats of both kazoo and lymph connections, since each run in it's own greenlet, beside that I couldn't follow very well your math, too much numbers for my taste :). As for network outrage, I don't think it apply here, since any real setup should make sure that zookeeper and instances live in the same network, since if there is a problem with communication between an instance and zookeeper, then the instance is already dead from the all point of views. |
lymph.core.connection.Connection pings the endpoint regularly
to determine whether it is still live.
__init__
takes aunresponsive_disconnect
parameter, which was however ignoredso far. Similarly, the
idle_disconnect
parameter did not haveany effect.
This commit uses thees two parameters to do what they advertise,
i.e. close the connection once the respective timeouts have been
reached.
To this end, the greenlet management logic was also changed, such
that the monitoring greenlets now commit suicide when the connection
status == CLOSED
instead of being explicitely killed fromclose()
.Without this change, the monitoring greenlet would inadvertently kill
itself, leading to (very nonobvious) dead code and leaving a half-dead
connection object still registered with the server.
(edited as requested by @mouadino)