-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing log entries when main exits before threads #38
Comments
I think a Detecting the "race" between program exit and message send seems tricky, since the exit could happen during the message send, or during the "transaction begin" you propose, or just before the "transaction begin", etc. I think a cleaner way might be to assert that there is only one thread running at program exit (via an |
Right that makes sense for the panic. I asked the Servo developers and they don't have any sort of shutdown procedure for all threads per-se (when shutdowns synchronization matters they handle that for those specific threads). The problem for us is when the main thread exits we end up with some nondeterminism as threads threads are killed by the OS.
Given what the devs said about Servo's shutdown model this will probably not work. Fundamentally I'm trying to figure out if there is a way for us to detect these missing entries. As they lead to timeout errors that are very common. So if we can filter some out knowing they're "fake" timeout errors, that would be great. I agree the transaction based approach probably won't work, no matter how I think about it, it is just not atomic enough. |
Detecting missing entries seems hard, we don't know if we would have seen an extra message on the next run. However, if the Servo devs aren't worried about thread exit and missing messages then we shouldn't be either. Maybe there's some way to identify these "unimportant messages" and exclude them from the log so the log contents are deterministic. However, I don't really understand how these unimportant/missing messages result in the "timeout errors" you mentioned. |
So it may not necessarily be unimportant messages that are lost. Here is a possible trace which I observed with the code above. On record:
I can tell this was the true ordering of events since channels are FIFO and the main thread prints things in this ordering. When checking debug logs and the recordlog I see that: Thread 2 got to write it's message to the channel but was unable to write this action to the record-log. Main thread recorded the fact that it got a message from Thread 2 into the recordlog. On replay:
|
I think I'm getting confused between the test code you listed above (where I can see the race and the timeout leading to a panic) and what's happening in the Servo tests. I gathered that if the Servo devs aren't worried about shutting down threads cleanly, then there are some "unimportant" messages that perhaps get sent (or not) during that unclean shutdown. But I'm filling in lots of gaps here :-). I guess my question is: is the test code above representative of what's causing the Servo tests to flake out? |
Sorry for the confusion. I haven't witnessed this bug in any Servo test. This is purely a bug in our So I'm trying to get all these examples working first, before looking at the harder-to-debug Servo issues. |
Got it! I can't think of a clean solution. Maybe just logging first and sending the message second would lead to fewer of these log discrepancies (this is ignoring the core issue, I know). I think having ways to diagnose the root cause of the flakiness would be valuable, but it's hard to know how to do that without debugging a few actual flakes and then crafting an analysis around it. |
Our testing has proven fruitful as I have seen some interesting bugs via running our integration tests. Currently, I see a timeout error or messages arrive in the wrong order during replay.
The actual error is missing log entries! A thread sends a message and
records the sending event in the record_log
, the receiver thread receives the message and record this event in our record_log. The highlighted part in the last sentence never happens however. The problems is with atomicity: sending + recording should be atomic but it isn't. Writing to the log entails getting the lock for aFile
and writing the results to it. This happens right after we send a message down the channel.However, it seems it is possible for the thread to send the message, and get stuck trying to acquire the lock. This happens because the lock is being used by someone else (the critical sections is small so I'm surprised this happens). Furthermore, I see other threads which sent their message at a later point in time acquire the lock before our sender thread (which is inside the realm of possibility as mutexes provide no acquire gurantee based on thread arrival order). Then the program ends and this sender thread never gets to write its message to the log.
I'm pretty sure this is what is happening as adding a join statement for all spawned threads in the client code fixes the issue. This issue happens rarely and adding logging output makes it harder to reproduce. With no logging output it happens very often when running the test 30 times in a row. With some logging output it took about 360 executions for it to happen. When joining threads, it never happened after 1000 executions of the test.
Here is the offending test with thread joining to fix the issue:
I don't think we can assume that Servo or other client code properly handles thread cleanup on program exit? Not to mention on
panic!
from the main thread...Maybe we can provide this atomicity by using transactions that begin before the message is sent and end when the message is written to the log? We would detect this as transactions that started but do not finish.
However I can't reproduce it manually. I'll look more into this.
The text was updated successfully, but these errors were encountered: