etcd gets into bad state -- snapshots fail until etcd process is restarted #1496

jstamerj · 2014-10-29T21:59:30Z

I've seen this on OSX and linux with etcd v0.4.6.

When etcd hits the maxFiles ulimit (ulimit -n), and it is snapshotting at the same time, etcd gets into a state where all future snapshots will fail (until the etcd process is killed and restarted).

Steps to repro:

To make this easier and faster to repro, set the maxfiles ulmit to something small
jeff@zen:~/etcd$ ulimit -n 50
Start etcd. To make this easier and faster to repro, set -snapshot-count to something small
jeff@zen:~/etcd$ ./etcd -snapshot=true -snapshot-count 10
Issue requests/open connections to etcd until etcd runs into the maxfiles ulimit
* I used node / javascript to issue 50 concurrent wait requests to etcd. See below for the code.
* When etcd hits the limit, I see something like this in the etcd log to stdout:
2014/10/29 14:36:38 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 5ms 2014/10/29 14:36:38 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 10ms 2014/10/29 14:36:38 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 20ms 2014/10/29 14:36:38 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 40ms 2014/10/29 14:36:39 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 80ms 2014/10/29 14:36:39 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 160ms 2014/10/29 14:36:39 http: Accept error: accept tcp [::]:4001: too many open files; retrying in 320ms
Leave the connections open until etcd does the next snapshot. In the etcd log, I see something like this:
[etcd] Oct 29 14:36:43.874 INFO | zen.local: snapshot of 12 events at index 351 completed
Cancel / Close the 50 open connections to etcd.
Watch the etcd log. All attempted snapshots fail. I see log entries like:
[etcd] Oct 29 14:36:49.876 INFO | zen.local: snapshot of 12 events at index 363 attempted and failed: Snapshot: Last snapshot is not finished.
Even days later, all attempted snapshots continue to fail. The memory usage of etcd steadily grows after this.
After killing/restarting etcd, snapshots begin to succeed again.

Expected: After opening and closing many connections (briefly hitting the maxFiles limit), snapshots should be able to succeed.

Here is some node/javascript code I used to open 50 simultaneous connections to etcd:

http = require('http')
request = require('request')

http.globalAgent.maxSockets = 10240 // The default of 5 is much too low.                                                                         

var etcdUrl = 'http://localhost:4001';
var etcdWatchUrl = etcdUrl + '/v2/keys/?wait=true';

var watchRequests = [];

function startWatch() {
    var req = request(etcdWatchUrl, function(err, response, body) {
        console.log('Watch completed', err, body);
    });
    watchRequests.push(req);
}

function startWatches(numWatches) {
    for (var i = 0; i < numWatches; i++) {
    startWatch();
    }
}

startWatches(50);

-- Jeff

The text was updated successfully, but these errors were encountered:

kelseyhightower · 2014-10-29T23:38:54Z

@jstamerj Thanks for reporting this issue. I'm going to try and reproduce this for the core team, just to quickly validate your findings.

…apshot fails. When a snapshot fails to save, Sever.pendingSnapshot is not set to nil (like it is after a snapshot is successfully saved). This causes all future calls to TakeSnapshot() to fail with an error "Snapshot: Last snapshot is not finished." Fixes etcd-io#1496

jstamerj · 2014-10-30T00:29:12Z

I believe I've found the issue in the raft code. See goraft/raft#242

xiang90 · 2014-12-05T05:23:37Z

@jstamerj We are moving to our own raft implementation at etcd/raft. This should be solved in 0.5 release.
Thanks for reporting!

jstamerj changed the title ~~etcd gets into bad state and fails to snapshot until after etcd process is restarted~~ etcd gets into bad state -- snapshots fail until etcd process is restarted Oct 29, 2014

xiang90 closed this as completed Dec 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd gets into bad state -- snapshots fail until etcd process is restarted #1496

etcd gets into bad state -- snapshots fail until etcd process is restarted #1496

jstamerj commented Oct 29, 2014

kelseyhightower commented Oct 29, 2014

jstamerj commented Oct 30, 2014

xiang90 commented Dec 5, 2014

etcd gets into bad state -- snapshots fail until etcd process is restarted #1496

etcd gets into bad state -- snapshots fail until etcd process is restarted #1496

Comments

jstamerj commented Oct 29, 2014

kelseyhightower commented Oct 29, 2014

jstamerj commented Oct 30, 2014

xiang90 commented Dec 5, 2014