-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing the checksum reproducability masks testing actual reproducability #39
Comments
To clarify further, there are (at least) three distinct things we need to test for:
These form a hierarchy - we can't correctly interpret and investigate a failure of test 2 or 3 unless the model has first passed test 1 (and failures of test 1 can happen: we saw reproducible failures of test 1 in early versions of ACCESS-OM3 and occasional failures of test 1 in production runs of ACCESS-OM2). |
The next question is how to test for these three model properties. I wonder why Checksums and Of course, checks based on restarts rely on the restarts actually capturing the complete model state, i.e. test 2 passing. |
Thanks to @dougiesquire for pointing out there are two tests marked "checksum_slow", which cover items 1. and 2. in @aekiss' list. The test marked "checksum" covers item 3. Only tests marked "access_om2" (i.e. the qa tests) and those marked "checksum" are run in CI, but maybe we should run the "checksum_slow" tests more often? (is the compute cost really that high?) |
It's not just compute cost. The tests use the PBS queuing system, which can be slow/unpredictable, and so isn't a great test to be running routinely for CI tests where rapid answers are the norm. The counter to this is that it is possible to just cancel a test if you don't need the results, but I'm generally in favour of the default automatic behaviour of these systems to be the one that is most commonly used, and not require human intervention, because human. We also talked about putting in some "on demand" repro testing via comment commands. I think we'll do this, just a question of prioritisation. |
I am wondering if running them at low resolution is ok (fairly fast) and skipping at high resolution (on the assumption the binary is the same at both resolutions) makes sense. Also, running only when the historical checksum test fails avoids waiting for the test unless its needed. |
This was simply just to get something in place, because something is better than nothing. This is what is done in MOM6's own regression testing. But I agree something more robust is better. We're also not currently testing any CICE output, but we should. |
Fair enough. The manifest hashes would cover all model components, so that's another good reason to use them. |
However, it's worth noting that the barotropic restarts in ACCESS-OM2 did not have reproducible md5 hashes, but since all the other components did reproduce this was not investigated further. Might just be something like a run timestamp in the barotropic restarts. |
Thanks to @aekiss for highlighting this.
The current "historical" reproducibility tests check that the current model configuration results are the same as the saved checksum in the configuration repository. There are two issues with this:
ocean.stats
file (i.e. compensating errors would get missed).The text was updated successfully, but these errors were encountered: