-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd gets into bad state -- snapshots fail until etcd process is restarted #1496
Comments
jstamerj
changed the title
etcd gets into bad state and fails to snapshot until after etcd process is restarted
etcd gets into bad state -- snapshots fail until etcd process is restarted
Oct 29, 2014
@jstamerj Thanks for reporting this issue. I'm going to try and reproduce this for the core team, just to quickly validate your findings. |
jstamerj
pushed a commit
to HBOCodeLabs/etcd
that referenced
this issue
Oct 30, 2014
…apshot fails. When a snapshot fails to save, Sever.pendingSnapshot is not set to nil (like it is after a snapshot is successfully saved). This causes all future calls to TakeSnapshot() to fail with an error "Snapshot: Last snapshot is not finished." Fixes etcd-io#1496
This was referenced Oct 30, 2014
I believe I've found the issue in the raft code. See goraft/raft#242 |
@jstamerj We are moving to our own raft implementation at etcd/raft. This should be solved in 0.5 release. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I've seen this on OSX and linux with etcd v0.4.6.
When etcd hits the maxFiles ulimit (ulimit -n), and it is snapshotting at the same time, etcd gets into a state where all future snapshots will fail (until the etcd process is killed and restarted).
Steps to repro:
jeff@zen:~/etcd$ ulimit -n 50
jeff@zen:~/etcd$ ./etcd -snapshot=true -snapshot-count 10
* I used node / javascript to issue 50 concurrent wait requests to etcd. See below for the code.
* When etcd hits the limit, I see something like this in the etcd log to stdout:
2014/10/29 14:36:38 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 5ms 2014/10/29 14:36:38 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 10ms 2014/10/29 14:36:38 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 20ms 2014/10/29 14:36:38 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 40ms 2014/10/29 14:36:39 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 80ms 2014/10/29 14:36:39 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 160ms 2014/10/29 14:36:39 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 320ms
[etcd] Oct 29 14:36:43.874 INFO | zen.local: snapshot of 12 events at index 351 completed
[etcd] Oct 29 14:36:49.876 INFO | zen.local: snapshot of 12 events at index 363 attempted and failed: Snapshot: Last snapshot is not finished.
Expected: After opening and closing many connections (briefly hitting the maxFiles limit), snapshots should be able to succeed.
Here is some node/javascript code I used to open 50 simultaneous connections to etcd:
-- Jeff
The text was updated successfully, but these errors were encountered: