Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing log entries when main exits before threads #38

Open
gatoWololo opened this issue May 11, 2020 · 7 comments
Open

Missing log entries when main exits before threads #38

gatoWololo opened this issue May 11, 2020 · 7 comments

Comments

@gatoWololo
Copy link
Owner

gatoWololo commented May 11, 2020

Our testing has proven fruitful as I have seen some interesting bugs via running our integration tests. Currently, I see a timeout error or messages arrive in the wrong order during replay.

The actual error is missing log entries! A thread sends a message and records the sending event in the record_log, the receiver thread receives the message and record this event in our record_log. The highlighted part in the last sentence never happens however. The problems is with atomicity: sending + recording should be atomic but it isn't. Writing to the log entails getting the lock for a File and writing the results to it. This happens right after we send a message down the channel.

However, it seems it is possible for the thread to send the message, and get stuck trying to acquire the lock. This happens because the lock is being used by someone else (the critical sections is small so I'm surprised this happens). Furthermore, I see other threads which sent their message at a later point in time acquire the lock before our sender thread (which is inside the realm of possibility as mutexes provide no acquire gurantee based on thread arrival order). Then the program ends and this sender thread never gets to write its message to the log.

I'm pretty sure this is what is happening as adding a join statement for all spawned threads in the client code fixes the issue. This issue happens rarely and adding logging output makes it harder to reproduce. With no logging output it happens very often when running the test 30 times in a row. With some logging output it took about 360 executions for it to happen. When joining threads, it never happened after 1000 executions of the test.

Here is the offending test with thread joining to fix the issue:

use rr_channel::detthread;
use rr_channel::mpsc;

fn main() {
    let mut handles = Vec::new();

    // Create a shared channel that can be sent along from many threads.
    let (tx, rx) = mpsc::channel();
    for i in 0..10 {
        let tx = tx.clone();
        let h = detthread::spawn(move || {
            tx.send(i).unwrap();
        });
        handles.push(h);
    }

    for _ in 0..10 {
        let j = rx.recv().unwrap();
        println!("Received {}", j);
    }

    for h in handles {
        h.join().unwrap();
    }
}

I don't think we can assume that Servo or other client code properly handles thread cleanup on program exit? Not to mention on panic! from the main thread...

Maybe we can provide this atomicity by using transactions that begin before the message is sent and end when the message is written to the log? We would detect this as transactions that started but do not finish.

Currently, I see a timeout error or messages arrive in the wrong order during replay.
Seeing the the timeout error makes sense. Seeing the messages arrived out of order doesn't.

However I can't reproduce it manually. I'll look more into this.

@devietti
Copy link

I think a panic would be covered by our quasi-deterministic cop-out. Maybe we'd get lucky and TiVo would render the underlying cause of the panic deterministic, but there are always lots of reasons things could crash that we can't control (out-of-memory, out-of-disk, ...).

Detecting the "race" between program exit and message send seems tricky, since the exit could happen during the message send, or during the "transaction begin" you propose, or just before the "transaction begin", etc. I think a cleaner way might be to assert that there is only one thread running at program exit (via an atexit handler?).

@gatoWololo
Copy link
Owner Author

I think a panic would be covered by our quasi-deterministic cop-out

Right that makes sense for the panic. I asked the Servo developers and they don't have any sort of shutdown procedure for all threads per-se (when shutdowns synchronization matters they handle that for those specific threads). The problem for us is when the main thread exits we end up with some nondeterminism as threads threads are killed by the OS.

I think a cleaner way might be to assert that there is only one thread running at program exit (via an atexit handler?

Given what the devs said about Servo's shutdown model this will probably not work.

Fundamentally I'm trying to figure out if there is a way for us to detect these missing entries. As they lead to timeout errors that are very common. So if we can filter some out knowing they're "fake" timeout errors, that would be great. I agree the transaction based approach probably won't work, no matter how I think about it, it is just not atomic enough.

@devietti
Copy link

Detecting missing entries seems hard, we don't know if we would have seen an extra message on the next run.

However, if the Servo devs aren't worried about thread exit and missing messages then we shouldn't be either. Maybe there's some way to identify these "unimportant messages" and exclude them from the log so the log contents are deterministic.

However, I don't really understand how these unimportant/missing messages result in the "timeout errors" you mentioned.

@gatoWololo
Copy link
Owner Author

However, I don't really understand how these unimportant/missing messages result in the "timeout errors" you mentioned.

So it may not necessarily be unimportant messages that are lost. Here is a possible trace which I observed with the code above.

On record:

  1. Thread 1 writes message to channel.
  2. Thread 2 writes message to channel.
  3. Threads 3-10 write message to channel.
  4. Main thread reads 10 messages from channel.

I can tell this was the true ordering of events since channels are FIFO and the main thread prints things in this ordering.

When checking debug logs and the recordlog I see that: Thread 2 got to write it's message to the channel but was unable to write this action to the record-log. Main thread recorded the fact that it got a message from Thread 2 into the recordlog.

On replay:

  1. Thread 1 writes message to channel.
  2. Thread 2 sees no entry for itself in the log (run-off-the-log-error). We assume Thread 2 didn't get this far during record and we put this thread to sleep.
  3. Main Thread times out waiting for message from Thread 2 to arrive => panic!

@devietti
Copy link

I think I'm getting confused between the test code you listed above (where I can see the race and the timeout leading to a panic) and what's happening in the Servo tests.

I gathered that if the Servo devs aren't worried about shutting down threads cleanly, then there are some "unimportant" messages that perhaps get sent (or not) during that unclean shutdown. But I'm filling in lots of gaps here :-). I guess my question is: is the test code above representative of what's causing the Servo tests to flake out?

@gatoWololo
Copy link
Owner Author

Sorry for the confusion. I haven't witnessed this bug in any Servo test. This is purely a bug in our examples/. However, I would bet money that this totally happens inside Servo. During my internship I witnessed many Servo tests timing out for unknown reasons. "Unknown" because it is extremely difficult to sift through 1000s of log messages and understand why things are failing.

So I'm trying to get all these examples working first, before looking at the harder-to-debug Servo issues.

@devietti
Copy link

Got it! I can't think of a clean solution. Maybe just logging first and sending the message second would lead to fewer of these log discrepancies (this is ignoring the core issue, I know).

I think having ways to diagnose the root cause of the flakiness would be valuable, but it's hard to know how to do that without debugging a few actual flakes and then crafting an analysis around it.

@gatoWololo gatoWololo mentioned this issue May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants