Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing the checksum reproducability masks testing actual reproducability #39

Open
anton-seaice opened this issue Aug 2, 2024 · 8 comments
Assignees

Comments

@anton-seaice
Copy link

Thanks to @aekiss for highlighting this.

The current "historical" reproducibility tests check that the current model configuration results are the same as the saved checksum in the configuration repository. There are two issues with this:

  1. its not very strict only looking at the ocean.stats file (i.e. compensating errors would get missed).
  2. when it fails (which is often, i.e. every time the configuration is changed), it doesn't test the new configuration is reproduceable with itself. To confirm reproducibility, we need to run the configuration under test twice (under identical conditions).
@aekiss
Copy link
Contributor

aekiss commented Aug 3, 2024

To clarify further, there are (at least) three distinct things we need to test for:

  1. determinism - whether the same model produces identical results when run under identical conditions (including the same length of run)
  2. integrity of restarts - whether an experiment produces identical results regardless of whether it is conducted as a single model run or several, i.e. whether restarts precisely capture the complete model state
  3. model regressions - whether a change in the model code or configuration has unexpectedly changed the results

These form a hierarchy - we can't correctly interpret and investigate a failure of test 2 or 3 unless the model has first passed test 1 (and failures of test 1 can happen: we saw reproducible failures of test 1 in early versions of ACCESS-OM3 and occasional failures of test 1 in production runs of ACCESS-OM2).

@aekiss
Copy link
Contributor

aekiss commented Aug 3, 2024

The next question is how to test for these three model properties. I wonder why ocean.stats is being used, rather than the restart md5 hashes from the payu manifest?

Checksums and ocean.stats cannot give a watertight guarantee, because there may be roundoff in ocean.stats or compensation such as alternating errors which leave a checksum unchanged. Restart manifest md5 hashes offer a much stronger guarantee, essentially as good as a binary diff on the restarts, but much faster (in principle they may suffer the same cancellation problem, or be incorrect due to the payu logic that decides whether to calculate the md5 hash based on differences in the binhash, but this is vanishingly unlikely).

Of course, checks based on restarts rely on the restarts actually capturing the complete model state, i.e. test 2 passing.

@anton-seaice
Copy link
Author

Thanks to @dougiesquire for pointing out there are two tests marked "checksum_slow", which cover items 1. and 2. in @aekiss' list.

The test marked "checksum" covers item 3.

Only tests marked "access_om2" (i.e. the qa tests) and those marked "checksum" are run in CI, but maybe we should run the "checksum_slow" tests more often? (is the compute cost really that high?)

@aidanheerdegen
Copy link
Member

aidanheerdegen commented Aug 4, 2024

maybe we should run the "checksum_slow" tests more often? (is the compute cost really that high?)

It's not just compute cost. The tests use the PBS queuing system, which can be slow/unpredictable, and so isn't a great test to be running routinely for CI tests where rapid answers are the norm. The counter to this is that it is possible to just cancel a test if you don't need the results, but I'm generally in favour of the default automatic behaviour of these systems to be the one that is most commonly used, and not require human intervention, because human.

We also talked about putting in some "on demand" repro testing via comment commands. I think we'll do this, just a question of prioritisation.

@anton-seaice
Copy link
Author

I am wondering if running them at low resolution is ok (fairly fast) and skipping at high resolution (on the assumption the binary is the same at both resolutions) makes sense. Also, running only when the historical checksum test fails avoids waiting for the test unless its needed.

@dougiesquire
Copy link
Collaborator

I wonder why ocean.stats is being used, rather than the restart md5 hashes from the payu manifest?

This was simply just to get something in place, because something is better than nothing. This is what is done in MOM6's own regression testing. But I agree something more robust is better. We're also not currently testing any CICE output, but we should.

@aekiss
Copy link
Contributor

aekiss commented Aug 4, 2024

Fair enough. The manifest hashes would cover all model components, so that's another good reason to use them.

@aekiss
Copy link
Contributor

aekiss commented Aug 4, 2024

However, it's worth noting that the barotropic restarts in ACCESS-OM2 did not have reproducible md5 hashes, but since all the other components did reproduce this was not investigated further. Might just be something like a run timestamp in the barotropic restarts.

@CodeGat CodeGat self-assigned this Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants