Snapshot corruption in the state machine tests #558

jorisdral · 2025-01-31T11:08:08Z

#539 introduced a classic property test for detecting corruption in snapshots. #542 fixed a bug in the code that #539 tested. The next step would be to detect corruption in the state machine tests as well. First, I'll describe the current approach to fault testing in the state machine tests, and then how to add snapshot corruption to the state machine tests.

Overview of fault testing in the state machine

At the time of writing this issue (87549dd), the state machine already has the possibility to inject noisy disk faults using fs-sim into the SUT. These faults are noisy because they produce IO exceptions. Even with fault injection, it is not guaranteed that the SUT will fail with an exception. There is non-determinism in the outcome. However, the model is deterministic, which means that the model must decide to fail with an exception, or to succeed. We solve this discrepancy as follows:

Fault injection is explicit: either we inject faults into the SUT, or we do not. This is represented by a Maybe Errors in relevant state machine actions.
If an action has faults to inject, the model assumes the SUT will throw an exception, and so the model will also throw an exception.

lsm-tree/test/Database/LSMTree/Model/Session.hs

Lines 202 to 237 in 87549dd

    
           -- | @'runModelMWithInjectedErrors' merrs onNoErrors onErrors@ runs the model 
        
           -- with injected disk faults @merrs@. 
        
           -- 
        
           -- The model may run different actions based on whether there are injections or 
        
           -- not. 'onNoErrors' runs in case @merrs == Nothing@, and 'onErrors@ runs in 
        
           -- case @merrs == Just _@. 
        
           -- 
        
           -- Typically, @onNoErrors@ will be a model function like 'lookups', and 
        
           -- @onErrors@ should be the identity state transition. This models the approach 
        
           -- of the @lsm-tree@ library: the *logical* state of the database should remain 
        
           -- unchanged if there were any disk faults. 
        
           -- 
        
           -- The model functions in the remainder of the module only describe the happy 
        
           -- path with respect to disk faults: they assume there aren't any such faults. 
        
           -- The intent of 'runModelMWithInjectedErrors' is to augment the model with 
        
           -- responses to disk fault injection. 
        
           -- 
        
           -- The real system's is not guaranteed to fail with an error if there are 
        
           -- injected disk faults. However, the approach taken in 
        
           -- 'runModelMWithInjectedErrors' is to *always* throw a modelled disk error in 
        
           -- case there are injections, *even if* the real system happened to not fail 
        
           -- completely. This makes 'runModelMWithInjectedErrors' an approximation of the 
        
           -- real system's behaviour in case of injections. For the model to more 
        
           -- accurately model the system's behaviour, the model would have to know for 
        
           -- each API function precisely which errors are injected in which order, and 
        
           -- whether they are handled internally by the library or not. This is simply 
        
           -- infeasible: the model would become very complex. 
        
           runModelMWithInjectedErrors :: 
        
                Maybe e -- ^ The errors that are injected 
        
             -> ModelM a -- ^ Action to run on 'Nothing' 
        
             -> ModelM () -- ^ Action to run on 'Just' 
        
             -> Model -> (Either Err a, Model) 
        
           runModelMWithInjectedErrors Nothing onNoErrors _ st = 
        
               runModelM onNoErrors st 
        
           runModelMWithInjectedErrors (Just _) _ onErrors st = 
        
               runModelM (onErrors >> throwError (ErrFsError "modelled FsError")) st

If an action has faults to inject, the SUT will run with those injected faults using withErrors from the fs-sim package.

If the SUT threw an exception, then the model and SUT both return an exception so we're okay.
If the SUT did not throw an exception, then it accidentally succeeded in performing the action. The model and SUT will not be in sync, so what we do is that we roll back the successful action. What to roll back differs per action. For example if CreateSnapshot accidentally succeeded, then the roll back action is to delete the snapshot. This makes sure the model and SUT are in sync again, and then we throw a dummy error so that the model and SUT have matching responses to the action.

lsm-tree/test/Test/Database/LSMTree/StateMachine.hs

Lines 1301 to 1339 in 87549dd

    
           -- | @'runRealWithInjectedErrors' _ errsVar merrs action rollback@ runs @action@ 
        
           -- with injected errors if available in @merrs@. 
        
           -- 
        
           -- See 'Model.runModelMWithInjectedErrors': the real system is not guaranteed to 
        
           -- fail with an error if there are injected disk faults, but the model *always* 
        
           -- fails. In case the real system accidentally succeeded in running @action@ 
        
           -- when there were errors to inject, then we roll back the success using 
        
           -- @rollback@. This ensures that the model and system stay in sync. For example: 
        
           -- if creating a snapshot accidentally succeeded, then the rollback action is to 
        
           -- delete that snapshot. 
        
           runRealWithInjectedErrors :: 
        
                (MonadCatch m, MonadSTM m, PrimMonad m) 
        
             => String -- ^ Name of the action 
        
             -> RealEnv h m 
        
             -> Maybe Errors 
        
             -> m t-- ^ Action to run 
        
             -> (t -> m ()) -- ^ Rollback if the action *accidentally* succeeded 
        
             -> m (Either Model.Err t) 
        
           runRealWithInjectedErrors s env merrs k rollback = 
        
             case merrs of 
        
               Nothing -> do 
        
                 modifyMutVar faultsVar (InjectFaultNone s :) 
        
                 catchErr handlers k 
        
               Just errs -> do 
        
                 eith <- catchErr handlers $ FSSim.withErrors errsVar errs k 
        
                 case eith of 
        
                   Left (Model.ErrFsError _) -> do 
        
                     modifyMutVar faultsVar (InjectFaultInducedError s :) 
        
                     pure eith 
        
                   Left _ -> 
        
                     pure eith 
        
                   Right x -> do 
        
                     modifyMutVar faultsVar (InjectFaultAccidentalSuccess s :) 
        
                     rollback x 
        
                     pure $ Left $ Model.ErrFsError ("dummy: " <> s) 
        
             where 
        
               errsVar = envErrors env 
        
               faultsVar = envInjectFaultResults env 
        
               handlers = envHandlers env

Tasks

(feat: add corruptSnapshot to model #555) The model has to be made aware of snapshot corruption, so that it can respond appropriately when opening a snapshot that is corrupt. Moreover, we have to add a function to the model that explicitly corrupts a snapshot.
Add support for corrupting snapshots explicitly in the SUT. There are two options, and I'd probably recommend going with option (i) because it's the most straightforward to do and requires minimal code changes. We can implement option (ii) afterwards, but then we'll already have all the necessary boilerplate in place.
1. Use the flipSnapshotBit approach from Test snapshot corruption #539. The code to corrupt the snapshot can be factored out from prop_flipSnapshotBit, and then we'd probably need to add a corruptSnapshot function to the IsTable class and implement it using flipSnapshotBit.
2. Modify fs-sim to also perform silent corrupted writes in addition to the noisy corrupted writes it does now.
Represent the instruction to corrupt a snapshot in the CreateSnapshot action.
1. Something like Maybe (Double, Double) should be added, which describes which file and bit to corrupt in a snapshot
2. A Maybe Errors with silent corruption instructions.
There is currently already a Maybe Errors in the CreateSnapshot action. We probably do want to keep this field separate, because this existing Maybe Errors injects noisy exceptions. This has a separate goal from the corruption testing: we use it to test that we do not forget references, double release references, leak file handles, etc. Moreover, we use it to test that the internal database state stays consistent.

Then make the appropriate changes to handle updated CreateSnapshot action. Stuff like modifying generators, modifying runModel, modifying runIO and runIOSim, and so forth. Finally, run the tests, make fixes to the testing framework, the model, or the SUT, and so forth.

The text was updated successfully, but these errors were encountered:

wenkokke · 2025-02-07T17:16:04Z

I've opened #566 with my concerns around the use of Double_0_1.

There are some issues that are specific to the current implementation of snapshot corruption. Specifically, flipRandomBitInRandomFile may potentially fail to actually corrupt a file, if the randomness provided via Double_0_1 points to an empty file. This would make it difficult to match an implementation of corruptSnapshot that matches the model, since the model simply flips a boolean that says the snapshot is corrupted. I see two possible solutions:

Using the arbitrary generators approach—see #566—we could simply keep asking for a different file. This would work in most cases, except the case where all files are empty. However, due to the existence of the snapshot metadata and checksums files, I believe there should always be a file to corrupt, so this may not be an issue.
Alternatively, we could choose to corrupt empty files by adding a random bit at the end. This would likewise be easier when using the arbitrary generators approach, since we only require the added bit when corrupting an empty file.

I don't think there is an io-sim compatible notion of random number generation, so either way, the interface offered by IsTable needs to receive the randomness as an argument, whichever way we choose to represent it.

I've implemented the use of StdGen in a3ade6.

wenkokke · 2025-02-07T17:38:21Z

One possible solution might be to lift the tests themselves into the Gen monad using GenT.

jorisdral assigned wenkokke Jan 31, 2025

jorisdral mentioned this issue Jan 31, 2025

QLS: statically enable/disable fault injection depending on the SUT #534

Draft

wenkokke linked a pull request Feb 6, 2025 that will close this issue

feat(issue558): test silent snapshot corruption #564

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot corruption in the state machine tests #558

Snapshot corruption in the state machine tests #558

jorisdral commented Jan 31, 2025 •

edited

Loading

wenkokke commented Feb 7, 2025 •

edited

Loading

wenkokke commented Feb 7, 2025 •

edited

Loading

Snapshot corruption in the state machine tests #558

Snapshot corruption in the state machine tests #558

Comments

jorisdral commented Jan 31, 2025 • edited Loading

Overview of fault testing in the state machine

Tasks

wenkokke commented Feb 7, 2025 • edited Loading

wenkokke commented Feb 7, 2025 • edited Loading

jorisdral commented Jan 31, 2025 •

edited

Loading

wenkokke commented Feb 7, 2025 •

edited

Loading

wenkokke commented Feb 7, 2025 •

edited

Loading