Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd gets into bad state -- snapshots fail until etcd process is restarted #1496

Closed
jstamerj opened this issue Oct 29, 2014 · 3 comments
Closed

Comments

@jstamerj
Copy link

I've seen this on OSX and linux with etcd v0.4.6.

When etcd hits the maxFiles ulimit (ulimit -n), and it is snapshotting at the same time, etcd gets into a state where all future snapshots will fail (until the etcd process is killed and restarted).

Steps to repro:

  1. To make this easier and faster to repro, set the maxfiles ulmit to something small
    jeff@zen:~/etcd$ ulimit -n 50
  2. Start etcd. To make this easier and faster to repro, set -snapshot-count to something small
    jeff@zen:~/etcd$ ./etcd -snapshot=true -snapshot-count 10
  3. Issue requests/open connections to etcd until etcd runs into the maxfiles ulimit
    * I used node / javascript to issue 50 concurrent wait requests to etcd. See below for the code.
    * When etcd hits the limit, I see something like this in the etcd log to stdout:
    2014/10/29 14:36:38 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 5ms 2014/10/29 14:36:38 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 10ms 2014/10/29 14:36:38 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 20ms 2014/10/29 14:36:38 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 40ms 2014/10/29 14:36:39 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 80ms 2014/10/29 14:36:39 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 160ms 2014/10/29 14:36:39 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 320ms
  4. Leave the connections open until etcd does the next snapshot. In the etcd log, I see something like this:
    [etcd] Oct 29 14:36:43.874 INFO | zen.local: snapshot of 12 events at index 351 completed
  5. Cancel / Close the 50 open connections to etcd.
  6. Watch the etcd log. All attempted snapshots fail. I see log entries like:
    [etcd] Oct 29 14:36:49.876 INFO | zen.local: snapshot of 12 events at index 363 attempted and failed: Snapshot: Last snapshot is not finished.
  7. Even days later, all attempted snapshots continue to fail. The memory usage of etcd steadily grows after this.
  8. After killing/restarting etcd, snapshots begin to succeed again.

Expected: After opening and closing many connections (briefly hitting the maxFiles limit), snapshots should be able to succeed.

Here is some node/javascript code I used to open 50 simultaneous connections to etcd:

http = require('http')
request = require('request')

http.globalAgent.maxSockets = 10240 // The default of 5 is much too low.                                                                         

var etcdUrl = 'http://localhost:4001';
var etcdWatchUrl = etcdUrl + '/v2/keys/?wait=true';

var watchRequests = [];

function startWatch() {
    var req = request(etcdWatchUrl, function(err, response, body) {
        console.log('Watch completed', err, body);
    });
    watchRequests.push(req);
}

function startWatches(numWatches) {
    for (var i = 0; i < numWatches; i++) {
    startWatch();
    }
}

startWatches(50);

-- Jeff

@jstamerj jstamerj changed the title etcd gets into bad state and fails to snapshot until after etcd process is restarted etcd gets into bad state -- snapshots fail until etcd process is restarted Oct 29, 2014
@kelseyhightower
Copy link
Contributor

@jstamerj Thanks for reporting this issue. I'm going to try and reproduce this for the core team, just to quickly validate your findings.

jstamerj pushed a commit to HBOCodeLabs/etcd that referenced this issue Oct 30, 2014
…apshot fails.

When a snapshot fails to save, Sever.pendingSnapshot is not set to nil (like it is after a snapshot is successfully saved).  This causes all future calls to TakeSnapshot() to fail with an error "Snapshot: Last snapshot is not finished."

Fixes etcd-io#1496
@jstamerj
Copy link
Author

I believe I've found the issue in the raft code. See goraft/raft#242

@xiang90
Copy link
Contributor

xiang90 commented Dec 5, 2014

@jstamerj We are moving to our own raft implementation at etcd/raft. This should be solved in 0.5 release.
Thanks for reporting!

@xiang90 xiang90 closed this as completed Dec 5, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants