-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The codebase is deterministic. #161
Comments
So, as you'll have seen the parallelism is all done with very straightforward OpenMP sections, which most C/C++ compilers (MS, Intel, G++) will handle. The crucial thing about 2248 is that it's inside a #pragma critical section, which therefore, is thread-safe - but not thread-predictable - the results can differ based on which thread gets there first. So, this is a race condition, but a very obvious one that we're content with because...
For Intel Inspector, you need to run on a (very) large memory machine so that you can then use the "Locate Deadlocks and Data Races" option with Scope "Extremely thorough". Particularly, this turns off 4-bytes as the default variable size, and uses 1-byte granularity instead. If you don't do that, you get many false positives (e.g, when a pair of 2-byte shorts next to each in memory get safely altered by different threads, but would appear to be a problem if you only considered 4 bytes at a time). Those settings will locate the three examples of the data race above - all within safe critical sections in that part of SetupModel.cpp, and nothing else. It will take several hours to do so, and several hundred gigabytes of RAM, but that pays for an extremely rigorous survey of the code. |
(Look for the text "Memory access byte granularity: 1 byte" in the "Details" section of text, in the Intel Inspector analysis) |
Initially this looked a bit peculiar to me too, but I think I'm with you now. Considering how minimal the effort level is to do it, and the fact that it will not impact performance in a meaningful way, is it not worth just wrapping the reads of ct->S in a critical section the same way that the writes are, if for no other reason than to stop the tools from emitting false positives? |
@tomgreen66 - yes, you're right - ct itself is thread-local on 2204, different places (schools/workplaces), can have people that live in the same cell (2218-2221), hence thread-local ct for different threads can point to the same thing in memory, and ct->S and ct->susceptible can be the same memory on different threads. The critical sections surrounding writes to these keep things safe, but not deterministic in terms of order. But that's not a problem. @cjgdev - The races are genuine, not false positives; the warnings and races still happen with your patch, as thread entrances to the critical sections at 2225 and 2240 will still interleave with the change to ct->S on 2249. So no real benefit then of making the reads critical, unless the writes were in the same critical - which would end up with this section basically becoming single-threaded. Fundamentally, this is just the nature of this part of the algorithm; we are looping through places, with threads grabbing people from cells, to consider assigning to those places. |
Ok yep, that makes sense. I agree that making the whole section single-threaded would not be a step in the right direction. |
Thanks, both, for looking at this, and the constructive comments by the way. |
Happy to help, like @tomgreen66 I'm still ramping up on the code, and more eyes looking doesn't always find new things, but this discussion has been useful. |
Thanks for the detailed discussion. I am a bit of a purist when it comes to OpenMP and like at least an option to have repeatable results when using multiple threads otherwise how would you know a code change elsewhere impacts results with multiple threads. In a previous job OpenMP was even reproducible between single and multi-threaded by avoiding reductions on floating point. Maybe a cppdef could be used to highlight areas of code that are not repeatable when using multiple threads and give the option to remove OpenMP for these parts so at least tests with multiple threads could be run and checked against known results. My limited knowledge of stochastic processes is that it still should abide by some given laws/rules and using thread races to add "noise" is not ideal. Background is that I was looking at setting up some training for our University HPC service on Intel Inspector and spotted OpenMP was used in this code and like to learn approaches researchers take so thought it would be a relevant example for users. |
A couple of thoughts: Not all genres of algorithm can be parallelised while remaining deterministic. The algorithm we are talking about here in the covid-sim is in the "optimisation" genre of algorithm, where the order of steps contributes to the uniqueness of the answer in a way that cannot be factored out or avoided. (Unlike simple floating-point reductions, where a linear step of sorting the array of numbers before summing them will do the trick - at some cost). For algorithms where deterministic parallelisation is not possible, it's common to validate determinism only in single-threaded operation - which is what we do with our regression tests on each commit. Multi-threaded testing is a harder problem, and Intel Inspector - when properly configured - does the job very well, so I can happily recommended it for the training you have in mind. We do a fresh exhaustive test on covid-sim whenever changes are pushed that concern threaded sections. Proper configuration, unfortunately, requires maximum resources and overheads, and anything less than that tends to return a lot of false positives - eg, reported memory leaks that are demonstrably not there in hello-world type examples. Usually though, true errors will be detected as well as the false reports, so if you know what to look for, the cheaper analyses of Inspector are still useful. We could, as you suggest, put a #ifdef around that loop and allow a configuration where a deterministic network can be created, while keeping the rest of the code parallel, but I don't think in practice it would ever get used. That part of the code is only run very rarely to create a first network, and then all runs after save time by loading the cached network. It's important to state that we are not using a thread race to "add noise" - the resulting networks from a single or multi-threaded run are all equally statistically valid, and there are plenty of laws/rules concerning how people are assigned to places. It is appealing to the purist (I include myself as one!) to have everything totally deterministic throughout, but having it take minutes instead of hours to generate different, but statistically equivalent outcomes is an important consideration to weigh up. Finally, it's important to remember that stochastic runs use RNGs; and parallel stochastic runs will have an RNG per thread. So determinism will only hold when comparing runs generated with the same number of threads. |
This is where knowing the code helps so the impact of this can be ignored. I will experiment more but if multiple threads are not repeating results then I feel its missing a regression test for code changes that occur within OpenMP blocks. There is also another one but i assume could be ignored at: Line 2249 in 60da1ec
and Read Line 2226 in 60da1ec
There is also I think the possibility the critical sections could be made atomic to speed things up a bit since the sections are just incrementing values. If the code isnt used much then not a big deal. Its great to see the code and I find it a real learning experience to read code and approaches. Dont get to see much OpenMP code so good to stretch that part of my brain. Thanks for the discussion. |
(EDIT on 11th May): This is a key comment on the issue, listing the single area where thread races can cause different but equivalent stochastic instances of the initialisation network. Single-threading the loop surrounding these lines, or using a previous initialisation file is enough to ensure complete, single or multi-threaded determinism, for given thread count, seed and architecture, of the rest of the codebase. Line numbers refer to 60da1ec Yep, that's basically the same race... there are three in all, with most order permutations occurring in a 32-core run:
#pragma omp atomic needs a bit of careful thought, as it only works with very particular statements. 2171, 2248 and 2251 are potential candidates - but I need to check very carefully the semantics of the atomic. It would be trivial if this were a thread-shared variable being incremented, but I'm not 100% sure whether this will be ok on thread-local variables that point to the same memory from different threads. My gut feeling is that critical will be safer - but I will make a note of this and look into it further; as you say, it might give a little performance improvement. |
I also agree with you on having multi-threaded regression tests; in practice that's quite hard to automate, since to press the races to happen, you need a lot of time and resources, so for now that remains a manual step we do, rather than an automatic regression test. |
Thanks. Just to update. Just did some timings with 2 threads and removed openmp from that section of code and I dont see a performance drop (might actually improve) since data races could also harm performance. I also see in the regression testing no change in the output when run twice with the change so looks like the output isnt that sensitive to multiple threads where it was before. See code changes in omp_branch |
Yes - certainly not linear over threads, and there will be big overheard for small thread counts. With a much larger country and 32 cores (our usual config), the benefits in real time will stand out more. If ever I get time, I will benchmark this some more. But again, remember this is only done once... |
Thanks for taking time to look. Be good to see what impact removing OpenMP on the bigger models.
|
The way we usually handle problems with thread repeatability is implementing a safe parallel thread for statement. Something that can easily be used across the whole code for safe thread access using different multi-threading-safe methods that I'm not very familiar with, but I know it can be done. That way you normalize the use of multi-threading everything is neatly doing it's job. |
This is a very useful discussion. Might I suggest pinning it at the top for more visibility? (GitHub allows three pinned issues). The non-determinism criticisms have gotten a lot of public attention. |
Good suggestion, thank you @chanhosuh, I've done so. |
Due to likely attention on this thread, I am locking it to keep it a tidy reference. |
FURTHER UPDATE: Since misunderstanding and misuse continues with this, PR #272 makes the above section single-threaded, causing no significant change to results - (just a different stochastic instance of network at first initialisation), but determinism throughout the entire codebase. |
Apologies for renaming this thread again, but hope the contributors to this really helpful discussion will understand why. |
Interested to see how parallelism was performed in code and ran the code through the Intel Inspector using the Intel Compiler. It seems a few things are highlighted as race conditions. Mainly in SetupModel. For example:
Read a value from here
covid-sim/src/SetupModel.cpp
Line 2222 in 60da1ec
And decrement it here:
covid-sim/src/SetupModel.cpp
Line 2248 in 60da1ec
ct seems private in the OpenMP loop but since its a pointer I believe the pointer is protected between threads but the data the pointer is pointing too could be the same. Is that correct? I have a few more issues highlighted but wanted to check if I understand it correctly.
The text was updated successfully, but these errors were encountered: