Fault injection: get() incorrectly returns not_found #107

slfritchie · 2014-03-27T09:57:45Z

Fault injection parameters:

ftruncate(2): 7% chance (triggered once)
pread(2): 7% chance (triggered once)
other intercepted system calls: also 7% chance, but they are triggered 0 times in this test case.

Failing test case:

5> C1.
[{[{set,{var,1},
        {call,generic_qc_fsm,set_keys,
              [[<<9,217,162,128,149,197,176,236,10,163,247,68,141>>]]}},
   {set,{var,2},
        {call,generic_qc_fsm,open,
              ["/tmp/generic.qc",
               [read_write,
                {open_timeout,0},
                {max_file_size,100},
                {sync_strategy,none}]]}},
   {set,{var,15},
        {call,generic_qc_fsm,put,[{var,2},<<"k">>,<<>>]}},
   {set,{var,17},
        {call,generic_qc_fsm,put,
              [{var,2},
               <<9,217,162,128,149,197,176,236,10,163,247,68,141>>,
               <<>>]}},
   {set,{var,27},{call,generic_qc_fsm,close,[{var,2}]}},
   {set,{var,28},
        {call,generic_qc_fsm,open,
              ["/tmp/generic.qc",
               [read_write,
                {open_timeout,0},
                {max_file_size,100},
                {sync_strategy,none}]]}},
   {set,{var,29},{call,generic_qc_fsm,get,[{var,28},<<"k">>]}},
   {set,{var,30},
        {call,generic_qc_fsm,put,[{var,28},<<"k">>,<<>>]}},
   {set,{var,31},{call,generic_qc_fsm,close,[{var,28}]}},
   {set,{var,32},
        {call,generic_qc_fsm,open,
              ["/tmp/generic.qc",
               [read_write,
                {open_timeout,0},
                {max_file_size,100},
                {sync_strategy,none}]]}},
   {set,{var,33},
        {call,generic_qc_fsm,delete,[{var,32},<<"k">>]}},
   {set,{var,34},
        {call,generic_qc_fsm,get,
              [{var,32},<<9,217,162,128,149,197,...>>]}}],
  7354},
 [{verify_trace,[]}]]

The successful fault injection triggers are:

An ftruncate(2) call returns -1, errno ENOSPC
A pread(2) call returns -1, errno EIO

Test case trace of results:

Trace: [{set_keys,[<<9,217,162,128,149,197,176,236,10,163,247,68,141>>]},
        open,
        {put,yes,<<"k">>,<<>>},
        {put,yes,<<9,217,162,128,149,197,176,236,10,163,247,68,141>>,<<>>},
        close,open,
        {get,<<"k">>,<<>>},
        {put,yes,<<"k">>,<<>>},
        close,open,
        {delete,yes,<<"k">>},
        {get,<<9,217,162,128,149,197,176,236,10,163,247,68,141>>,not_found},
        close]
verify_trace: {get,expected,[<<>>],got,not_found}
verify_trace: failed

All of the DB mutations have status yes, meaning that the eleveldb API responded affirmatively to each call: i.e. there were no errors returned by the API.

This is probably not a truly minimal failing test case: the nature how the random number generator is used means that a certain number of commands needed to happen "in the middle" before a lucky 7% event happened at a crucial time. The two close & open cycles in the middle of the test case: they might be truly necessary, but they might not.

To reproduce, cut-and-paste to execute these commands. Tested with OS X. It ought to work with Linux, but honestly I haven't tested the faulterl library very much on Linux.

mkdir /tmp/e
cd /tmp/e
git clone [email protected]:basho/eleveldb.git
cd eleveldb
git checkout -f 391b885abb2005e4b3f4f96a16822c9b1f31601f
make ; rebar skip_deps=true eunit suites=NONE ; erlc -o deps/faulterl/ebin -I deps/faulterl/include priv/scenario/*erl ; deps/faulterl/ebin/make_intercept_c.escript trigger_commonpaths yo ; env `deps/faulterl/ebin/example_environment.sh $PWD/yo` erl -pz .eunit deps/*/ebin

You now have an Erlang shell running with a fault-injectable VM. Cut-and-paste to execute the following three commands.

C1 = [{[{set,{var,1},{call,generic_qc_fsm,set_keys,[[<<9,217,162,128,149,197,176,236,10,163,247,68,141>>]]}},{set,{var,2},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,15},{call,generic_qc_fsm,put,[{var,2},<<107>>,<<>>]}},{set,{var,17},{call,generic_qc_fsm,put,[{var,2},<<9,217,162,128,149,197,176,236,10,163,247,68,141>>,<<>>]}},{set,{var,27},{call,generic_qc_fsm,close,[{var,2}]}},{set,{var,28},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,29},{call,generic_qc_fsm,get,[{var,28},<<107>>]}},{set,{var,30},{call,generic_qc_fsm,put,[{var,28},<<107>>,<<>>]}},{set,{var,31},{call,generic_qc_fsm,close,[{var,28}]}},{set,{var,32},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,33},{call,generic_qc_fsm,delete,[{var,32},<<107>>]}},{set,{var,34},{call,generic_qc_fsm,get,[{var,32},<<9,217,162,128,149,197,176,236,10,163,247,68,141>>]}}],7354},[{verify_trace,[]}]].
eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(false, false)), C1).
eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(true, false)), C1).

The text was updated successfully, but these errors were encountered:

slfritchie · 2014-03-27T10:04:00Z

This one is a bit more interesting. It still has a close & open cycle in the middle (as well as a failed open attempt), but it only uses a single key.

C3 = [{[{set,{var,1},{call,generic_qc_fsm,set_keys,[[]]}},{set,{var,2},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,7},{call,generic_qc_fsm,put,[{var,2},<<107>>,<<>>]}},{set,{var,9},{call,generic_qc_fsm,close,[{var,2}]}},{set,{var,10},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,15},{call,generic_qc_fsm,close,[{var,10}]}},{set,{var,16},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,40},{call,generic_qc_fsm,close,[{var,16}]}},{set,{var,41},{call,generic_qc_fsm,open,[[47,116,109,112,47,103,101,110,101,114,105,99,46,113,99],[read_write,{open_timeout,0},{max_file_size,100},{sync_strategy,none}]]}},{set,{var,42},{call,generic_qc_fsm,get,[{var,41},<<107>>]}}],8318},[{verify_trace,[]}]].
eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(true, false)), C3).

The diagnostic output:

Trace: [{set_keys,[]},
        open,
        {put,yes,<<"k">>,<<>>},
        close,open,close,
        {open,{error,{db_open,"IO error: lock /tmp/generic.qc/LOCK: Resource deadlock avoided"}}},
        open,
        {get,<<"k">>,not_found},
        close]
verify_trace: {get,expected,[<<>>],got,not_found}
verify_trace: failed

slfritchie · 2014-03-27T10:11:35Z

I've seen over a dozen variations of this failure. (I have yet to have a 5 minute run succeed.) Each failing case has a close-and-reopen cycle in the middle.

matthewvon · 2014-03-28T15:49:11Z

If you are failing reads, that seems to make sense that <<"k">> is sometimes not found. What am I missing?

slfritchie · 2014-03-28T16:03:49Z

The trace shows the order of operations: open, put <<"k">>, close, open, close, open (fails), open, get <<"k">> -> not_found. That last operation is definitely bogus. The fault injection isn't causing pread(2) to return invalid data: the fault injection is causing a return of -1 + valid errno value. I don't believe that eleveldb should be returning {ok, ...} in such a case.

slfritchie · 2014-04-03T16:27:47Z

Hi. I have another case. I'm not sure if it's different or not. The case frequently passes, which is annoying. But it might have something to do with compaction, so I'm wondering if I might open another ticket -- if I run the test case 25 times it a row, it will always fail before the end of the 25th run.

{[{set,{var,1},
       {call,generic_qc_fsm,set_keys,
             [[<<116,116,0,0,0,0>>],{var,parameter_test_dir}]}},
  {set,{var,2},
       {call,generic_qc_fsm,open,
             [{var,parameter_test_dir},
              [read_write,
               {open_timeout,0},
               {max_file_size,100},
               {sync_strategy,none},
               {create_if_missing,true},
               {limited_developer_mem,true},
               {compression,false},
               {write_buffer_size,16000000}]]}},
  {set,{var,6},
       {call,generic_qc_fsm,put,[{var,2},<<116,116,0,0,0,0>>,<<"Eà">>]}},
  {set,{var,8},
       {call,generic_qc_fsm,put_filler,[{var,2},{3085,<<137,112>>},50170]}},
  {set,{var,15},{call,generic_qc_fsm,get,[{var,2},<<116,116,0,0,0,0>>]}}],
 95149}
Trace: [{set_keys,[<<116,116,0,0,0,0>>]},
        open,
        {put,yes,<<116,116,0,0,0,0>>,<<"Eà">>},
        {get,get,<<116,116,0,0,0,0>>,not_found},
        close]
verify_trace: {get,get,<<116,116,0,0,0,0>>,expected,[<<"Eà">>],got,not_found}

The put_filler step is writing 3,085 keys with a prefix of <<137,112>> and value blobs of 50,170 bytes each (for about approximately 150MBytes of data).

slfritchie · 2014-04-04T06:40:51Z

It's also worth noting that the case that I mention from approx. 14 hours ago does not have close & reopen steps in the middle.

slfritchie · 2014-04-05T06:57:24Z

I don't know if there's one or more separate bug/trigger/thingies in here, but FWIW I observed a failure last night where get didn't find a key but fold did. Here's an abbreviated trace of events.

Trace: [{open,ok}, 
        ...
        {close,ok},
        {open,ok},
        {put,yes,<<228,240,133,109,0,0,0,0>>,<<"\ey½K-">>}, 
        {get,fold,<<228,240,133,109,0,0,0,0>>,<<"\ey½K-">>},
        ...
        {get,fold,<<228,240,133,109,0,0,0,0>>,<<"\ey½K-">>},
        ...
        {get,get,<<228,240,133,109,0,0,0,0>>,<<"\ey½K-">>}, 
        ...
        {get,fold,<<228,240,133,109,0,0,0,0>>,<<"\ey½K-">>},
        {get,get,<<228,240,133,109,0,0,0,0>>,not_found},
        ...
        {get,fold,<<228,240,133,109,0,0,0,0>>,<<"\ey½K-">>},
        ...
        {delete,yes,<<228,240,133,109,0,0,0,0>>},
        ...
        {close,ok}]
verify_trace: {get,get,
                   <<228,240,133,109,0,0,0,0>>,
                   expected,
                   [<<"\ey½K-">>],
                   got,not_found}
final_close_ok: true  
postconditions: passed
verify_trace: failed

slfritchie · 2014-04-10T04:48:54Z

Here's more background info on the original failing test case.

I commented out the really_delete_dir() call on line 49 of test/generic_qc_fsm.erl to preserve the data dir after the failing test runs and fails. I've put a tarball of the /tmp/generic_qc data dir and put it at http://www.snookles.com/scotttmp/generic.qc.tar.gz

The failing test case triggers one message to syslogd:

Apr 10 13:44:05 sbb3.local beam.smp[31744]: Close ftruncate failed [28, No space left on device]

matthewvon · 2014-04-10T12:46:57Z

@slfritchie My theory of the moment is that this is "functions as designed". Set "paranoid_checks" to true in eleveldb … likely advanced.config in your Riak 2.0 environment. Instead of the key being not found, the open operation will abort … forever. If this expected behavior is confirmed, then we do not want to change the code.

slfritchie · 2014-04-11T05:31:39Z

Hrm, well, there's no Riak involved in this case.

IMHO a series of API-says-it-was-successful operations followed by another API-says-it-was-successful operation that doesn't return the expected value is a data loss bug. If paranoid_checks=true makes the problem disappear, then IMHO it should be impossible for Riak to run when paranoid_checks=false.

slfritchie · 2014-04-11T07:09:26Z

Current status: still seeing not_found for keys that indeed exist, even when paranoid_checks=true.

eleveldb commit: 35f681e of branch slf-faulterl-introduction3 (which is a rebased branch from slf-faulterl-introduction2), which is in turn based upon today's develop branch at 9ab8ed6
leveldb commit: f4cf10d2ba1b999f10805ac35f22145e80932544 of today's develop branch
faulterl commit: 88bd8c8b377f8cdae9683218f229706f4f79ea8e of today's master branch

To recreate, in two cut-and-paste steps. Tested on OS X 10.8.5 and Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-54-generic x86_64). First:

mkdir /tmp/e
cd /tmp/e
git clone [email protected]:basho/eleveldb.git
cd eleveldb
git checkout -f 35f681e6e936a7f036e661b24fc73e7fe6f3e18f
./rebar get-deps ; (cd deps/faulterl ; git checkout -f 88bd8c8b377f8cdae9683218f229706f4f79ea8e) ; make ; rm priv/*.so ; (cd c_src/leveldb ; git checkout -f f4cf10d2ba1b999f10805ac35f22145e80932544) ; make ; ./rebar skip_deps=true eunit suites=NONE ; erlc -o deps/faulterl/ebin -I deps/faulterl/include priv/scenario/*erl ; deps/faulterl/ebin/make_intercept_c.escript trigger_commonpaths yo ; env `deps/faulterl/ebin/example_environment.sh $PWD/yo` erl -pz .eunit deps/*/ebin

And second. This one fails for OS X (Ubu case part 2 is below):

C4 = [{[{set,{var,1},{call,generic_qc_fsm,set_keys,[[<<173,0,0,0,0>>],{var,parameter_test_dir}]}},{set,{var,2},{call,generic_qc_fsm,open,[{var,parameter_test_dir},[read_write,{paranoid_checks,true},{open_timeout,0},{max_file_size,100},{sync_strategy,none},{create_if_missing,true},{limited_developer_mem,true},{compression,false},{write_buffer_size,16000000}]]}},{set,{var,3},{call,generic_qc_fsm,put,[{var,2},<<173,0,0,0,0>>,<<>>]}},{set,{var,4},{call,generic_qc_fsm,put_filler,[{var,2},{2245,<<22>>},54214]}},{set,{var,6},{call,generic_qc_fsm,fold_all,[{var,2}]}},{set,{var,7},{call,generic_qc_fsm,close,[{var,2}]}},{set,{var,8},{call,generic_qc_fsm,open,[{var,parameter_test_dir},[read_write,{paranoid_checks,true},{open_timeout,0},{max_file_size,100},{sync_strategy,none},{create_if_missing,true},{limited_developer_mem,true},{compression,false},{write_buffer_size,16000000}]]}},{set,{var,9},{call,generic_qc_fsm,get,[{var,8},<<173,0,0,0,0>>]}},{set,{var,10},{call,generic_qc_fsm,fold_all,[{var,8}]}}],1023},[{verify_trace,[]}]].
eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(false, false)), C4).
[true = eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(true, false)), C4) || _ <- lists:seq(1,30)].

The Ubuntu 12 Linux part 2 is:

C4c = [{[{set,{var,1},{call,generic_qc_fsm,set_keys,[[<<173,0,0,0,0>>],{var,parameter_test_dir}]}},{set,{var,2},{call,generic_qc_fsm,open,[{var,parameter_test_dir},[read_write,{paranoid_checks,true},{open_timeout,0},{max_file_size,100},{sync_strategy,none},{create_if_missing,true},{limited_developer_mem,true},{compression,false},{write_buffer_size,16000000}]]}},{set,{var,3},{call,generic_qc_fsm,put,[{var,2},<<173,0,0,0,0>>,<<>>]}},{set,{var,4},{call,generic_qc_fsm,put_filler,[{var,2},{2245,<<22>>},54214]}},{set,{var,6},{call,generic_qc_fsm,fold_all,[{var,2}]}},{set,{var,7},{call,generic_qc_fsm,close,[{var,2}]}},{set,{var,8},{call,generic_qc_fsm,open,[{var,parameter_test_dir},[read_write,{paranoid_checks,true},{open_timeout,0},{max_file_size,100},{sync_strategy,none},{create_if_missing,true},{limited_developer_mem,true},{compression,false},{write_buffer_size,16000000}]]}},{set,{var,9},{call,generic_qc_fsm,get,[{var,8},<<173,0,0,0,0>>]}},{set,{var,10},{call,generic_qc_fsm,fold_all,[{var,8}]}}],9023},[{verify_trace,[]}]].
eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(false, false)), C4c).
[true = eqc:check(eqc:testing_time(30, generic_qc_fsm:prop(true, false)), C4c) || _ <- lists:seq(1,30)].

Here's an easier-to-read-for-humans formatting of the test case and failing diagnostic output:

{[{set,{var,1},
       {call,generic_qc_fsm,set_keys,
             [[<<173,0,0,0,0>>],{var,parameter_test_dir}]}},
  {set,{var,2},
       {call,generic_qc_fsm,open,
             [{var,parameter_test_dir},
              [read_write,
               {paranoid_checks,true},
               {open_timeout,0},
               {max_file_size,100},
               {sync_strategy,none},
               {create_if_missing,true},
               {limited_developer_mem,true},
               {compression,false},
               {write_buffer_size,16000000}]]}},
  {set,{var,3},{call,generic_qc_fsm,put,[{var,2},<<173,0,0,0,0>>,<<>>]}},
  {set,{var,4},{call,generic_qc_fsm,put_filler,[{var,2},{2245,<<22>>},54214]}},
  {set,{var,6},{call,generic_qc_fsm,fold_all,[{var,2}]}},
  {set,{var,7},{call,generic_qc_fsm,close,[{var,2}]}},
  {set,{var,8},
       {call,generic_qc_fsm,open,
             [{var,parameter_test_dir},
              [read_write,
               {paranoid_checks,true},
               {open_timeout,0},
               {max_file_size,100},
               {sync_strategy,none},
               {create_if_missing,true},
               {limited_developer_mem,true},
               {compression,false},
               {write_buffer_size,16000000}]]}},
  {set,{var,9},{call,generic_qc_fsm,get,[{var,8},<<173,0,0,0,0>>]}},
  {set,{var,10},{call,generic_qc_fsm,fold_all,[{var,8}]}}],
 1023}
Trace: [{open,ok},
        {put,yes,<<173,0,0,0,0>>,<<>>},
        {get,fold,<<173,0,0,0,0>>,<<>>},
        {close,ok},
        {open,ok},
        {get,get,<<173,0,0,0,0>>,not_found},
        {close,ok}]
verify_trace: {get,get,<<173,0,0,0,0>>,expected,[<<>>],got,not_found}
final_close_ok: true
verify_trace: failed

matthewvon · 2014-04-11T12:46:27Z

I still have neither built a leveldb unit test reproducer nor taken time to set up eqc.

That said, the paranoid_check=true and the syslog message point to Recover() log processing eating the error. Specifically the failed ftruncate call did not send an error return. Also, there is the possibility that even if the error was returned that it would be overwritten. Branch mv-tuning7 in basho/leveldb contains the changes (and a throttle change in version_set.cc that is unrelated).

slfritchie · 2014-04-11T14:15:34Z

Branch mv-tuning7 in basho/leveldb contains the changes [...]

I shouldn't be asking until tomorrow morning, but I'll ask now anyway ... why isn't mv-tuning7 merged to develop yet?

matthewvon · 2014-04-11T14:17:51Z

No code review. And there may be other minor bugs to fix. "tuning" branches are just that, minor stuff.

slfritchie · 2014-04-15T06:09:24Z

@matthewvon Nop, sorry, mv-tuning7 doesn't fix the bug: I still see test case C4 failing on OS X within 30 attempts, and typically within 3 attempts.

slfritchie · 2014-04-16T10:31:16Z

Update: Source drift/version control is a pain, but I think I put all of the moving parts in the proper place. Today's mv-tuning7 branches from both eleveldb and leveldb repos appear to fix the C1 bug above: it passes 5K iterations. The original failure case always/almost always fails the first time.

cmeiklejohn · 2014-06-03T17:15:47Z

Moving to 2.0.1.

slfritchie added the Bug label Mar 27, 2014

slfritchie added this to the 2.0-RC milestone Mar 27, 2014

slfritchie mentioned this issue Apr 16, 2014

Fault injection: get() incorrectly returns not_found (part II) #112

Open

slfritchie added the resilience label Apr 17, 2014

slfritchie added the test label May 1, 2014

cmeiklejohn modified the milestones: 2.0.1, 2.0-RC Jun 3, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault injection: get() incorrectly returns not_found #107

Fault injection: get() incorrectly returns not_found #107

slfritchie commented Mar 27, 2014

slfritchie commented Mar 27, 2014

slfritchie commented Mar 27, 2014

matthewvon commented Mar 28, 2014

slfritchie commented Mar 28, 2014

slfritchie commented Apr 3, 2014

slfritchie commented Apr 4, 2014

slfritchie commented Apr 5, 2014

slfritchie commented Apr 10, 2014

matthewvon commented Apr 10, 2014

slfritchie commented Apr 11, 2014

slfritchie commented Apr 11, 2014

matthewvon commented Apr 11, 2014

slfritchie commented Apr 11, 2014

matthewvon commented Apr 11, 2014

slfritchie commented Apr 15, 2014

slfritchie commented Apr 16, 2014

cmeiklejohn commented Jun 3, 2014

Fault injection: get() incorrectly returns not_found #107

Fault injection: get() incorrectly returns not_found #107

Comments

slfritchie commented Mar 27, 2014

slfritchie commented Mar 27, 2014

slfritchie commented Mar 27, 2014

matthewvon commented Mar 28, 2014

slfritchie commented Mar 28, 2014

slfritchie commented Apr 3, 2014

slfritchie commented Apr 4, 2014

slfritchie commented Apr 5, 2014

slfritchie commented Apr 10, 2014

matthewvon commented Apr 10, 2014

slfritchie commented Apr 11, 2014

slfritchie commented Apr 11, 2014

matthewvon commented Apr 11, 2014

slfritchie commented Apr 11, 2014

matthewvon commented Apr 11, 2014

slfritchie commented Apr 15, 2014

slfritchie commented Apr 16, 2014

cmeiklejohn commented Jun 3, 2014