Skip to content
This repository has been archived by the owner on Sep 6, 2018. It is now read-only.

Use exponential backoff for failed peer heartbeats. #193

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

macb
Copy link

@macb macb commented Feb 27, 2014

Noticed @xiangli-cmu mention exponential back-off for peer heartbeats in etcd-io/etcd#595 and thought it might make a good first attempt to contribute.

Feedback would be greatly appreciated.

@ongardie
Copy link

If indeed the backoff is desirable, have you considered placing a limit on the timeout? The concern I have is that a server could be down for arbitrary amounts of time, sending its timeout through the roof. Then, when it came back, it'd be ignored for an unnecessary period of time.

@macb
Copy link
Author

macb commented Feb 28, 2014

I was thinking about that but wasn't sure what the arbitrary limit should be

On Thu, Feb 27, 2014 at 7:45 PM, Diego Ongaro [email protected]
wrote:

If indeed the backoff is desirable, have you considered placing a limit on the timeout? The concern I have is that a server could be down for arbitrary amounts of time, sending its timeout through the roof. Then, when it came back, it'd be ignored for an unnecessary period of time.

Reply to this email directly or view it on GitHub:
#193 (comment)

@philips
Copy link
Member

philips commented Feb 28, 2014

@macb In etcd-io/etcd#595 I was more meaning that the logging should backoff exponentially. The backoff on this side, if we add any, should be capped at a second or two.

@macb
Copy link
Author

macb commented Feb 28, 2014

@philips understandable. I had originally looked into logging backoff for failed heartbeats but didn't see a neat way to approach that. @xiangli-cmu had mentioned heartbeat probing back-off as well and it seemed like it'd kill two birds with one stone.

A limit definitely makes sense, but I didn't want to do much else without getting feedback from more involved devs.

@xiang90
Copy link
Contributor

xiang90 commented Feb 28, 2014

@macb

  1. We can do back-off probing with limited growth (seconds)
  2. We can do actively requesting if one node loss connection with the leader and is approaching election timeout or just restart.

We are re-writing the heartbeat function. I think we can just leave this pull request here for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants