Use exponential backoff for failed peer heartbeats. #193

macb · 2014-02-27T09:22:10Z

Noticed @xiangli-cmu mention exponential back-off for peer heartbeats in etcd-io/etcd#595 and thought it might make a good first attempt to contribute.

Feedback would be greatly appreciated.

ongardie · 2014-02-28T03:45:52Z

If indeed the backoff is desirable, have you considered placing a limit on the timeout? The concern I have is that a server could be down for arbitrary amounts of time, sending its timeout through the roof. Then, when it came back, it'd be ignored for an unnecessary period of time.

macb · 2014-02-28T03:47:56Z

I was thinking about that but wasn't sure what the arbitrary limit should be

On Thu, Feb 27, 2014 at 7:45 PM, Diego Ongaro [email protected]
wrote:

If indeed the backoff is desirable, have you considered placing a limit on the timeout? The concern I have is that a server could be down for arbitrary amounts of time, sending its timeout through the roof. Then, when it came back, it'd be ignored for an unnecessary period of time.

Reply to this email directly or view it on GitHub:
#193 (comment)

philips · 2014-02-28T03:51:33Z

@macb In etcd-io/etcd#595 I was more meaning that the logging should backoff exponentially. The backoff on this side, if we add any, should be capped at a second or two.

macb · 2014-02-28T04:13:22Z

@philips understandable. I had originally looked into logging backoff for failed heartbeats but didn't see a neat way to approach that. @xiangli-cmu had mentioned heartbeat probing back-off as well and it seemed like it'd kill two birds with one stone.

A limit definitely makes sense, but I didn't want to do much else without getting feedback from more involved devs.

xiang90 · 2014-02-28T04:21:48Z

@macb

We can do back-off probing with limited growth (seconds)
We can do actively requesting if one node loss connection with the leader and is approaching election timeout or just restart.

We are re-writing the heartbeat function. I think we can just leave this pull request here for now.

Use exponential backoff for failed peer heartbeats.

bec213f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use exponential backoff for failed peer heartbeats. #193

Use exponential backoff for failed peer heartbeats. #193

macb commented Feb 27, 2014

ongardie commented Feb 28, 2014

macb commented Feb 28, 2014

If indeed the backoff is desirable, have you considered placing a limit on the timeout? The concern I have is that a server could be down for arbitrary amounts of time, sending its timeout through the roof. Then, when it came back, it'd be ignored for an unnecessary period of time.

philips commented Feb 28, 2014

macb commented Feb 28, 2014

xiang90 commented Feb 28, 2014

Use exponential backoff for failed peer heartbeats. #193

Are you sure you want to change the base?

Use exponential backoff for failed peer heartbeats. #193

Conversation

macb commented Feb 27, 2014

ongardie commented Feb 28, 2014

macb commented Feb 28, 2014

If indeed the backoff is desirable, have you considered placing a limit on the timeout? The concern I have is that a server could be down for arbitrary amounts of time, sending its timeout through the roof. Then, when it came back, it'd be ignored for an unnecessary period of time.

philips commented Feb 28, 2014

macb commented Feb 28, 2014

xiang90 commented Feb 28, 2014