Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-reproducible runs #266

Open
aekiss opened this issue Sep 9, 2022 · 28 comments
Open

non-reproducible runs #266

aekiss opened this issue Sep 9, 2022 · 28 comments

Comments

@aekiss
Copy link
Contributor

aekiss commented Sep 9, 2022

Moving a private Slack chat here.

@adele157 is re-running a section of my 01deg_jra55v140_iaf_cycle3 experiment with extra tracers on branch 01deg_jra55v140_iaf_cycle3_antarctic_tracers here
/home/157/akm157/access-om2/01deg_jra55v140_iaf_cycle3_antarctic_tracers/.

Her re-run matches my original up to run 590, but not for run 591 and later.

Note that @adele157 has 2 sets of commits for runs 587-610 on branch 01deg_jra55v140_iaf_cycle3_antarctic_tracers. Ignore the first set - they had the wrong timestep.

Differences in md5 hashes in manifests/restart.yaml indicate bitwise differences in the restarts.
For some reason ocean_barotropic.res.nc md5 hashes never match, but presumably this is harmless if the other restarts match.

Relevant commits (01deg_jra55v140_iaf_cycle3..01deg_jra55v140_iaf_cycle3_antarctic_tracers) are

So it would seem that something different happened in run 590 so the restarts used by 591 differ.
@adele157 re-ran her 590 and the result was the same as her previous run.
So that seems to indicate something strange happened with my run 590 (0ab9c24).
I can't see anything suspicious for run 590 in my run summary. There are changes to manifests/input.yaml in my runs 588 and 589, but they don't seem relevant.

@aekiss
Copy link
Contributor Author

aekiss commented Sep 9, 2022

Restarts, executables and core count are the same. We would expect this to be reproducible, right? Is this what's tested here? https://accessdev.nci.org.au/jenkins/job/ACCESS-OM2/job/reproducibility/

@aekiss
Copy link
Contributor Author

aekiss commented Sep 9, 2022

here's the change in my run 588 COSIMA/01deg_jra55_iaf@0f0ec03
run 589 reverses it COSIMA/01deg_jra55_iaf@9546b24
as we can see from this diff between 587 and 589

@aekiss
Copy link
Contributor Author

aekiss commented Sep 9, 2022

@adele157 you said

The problem emerges during run 590, there are differences after day 20.

What variables were you comparing? Are the differences undetectable (bitwise) on day 19?

@adele-morrison
Copy link

adele-morrison commented Sep 9, 2022 via email

@aekiss
Copy link
Contributor Author

aekiss commented Sep 9, 2022

Yep, outputs are 32-bit (single precision) whereas internally and in restarts it's double-precision.

I meant, was the largest absolute difference in the single precision outputs exactly zero on day 19, and nonzero on day 20? Or was there a detectable difference earlier in the run?

@adele-morrison
Copy link

adele-morrison commented Sep 9, 2022 via email

@aekiss
Copy link
Contributor Author

aekiss commented Sep 9, 2022

can you show a plot of the difference on day 3?

@adele-morrison
Copy link

There is only regional output for the new run, so this is the whole domain we have to compare. This is the top ocean level, difference on day 3.
Screen Shot 2022-09-09 at 12 29 44 pm

@aidanheerdegen
Copy link
Contributor

It is odd.

So if the differences emerge by day 3 of run 590, then it must be in the restarts from run 589 and yet there is no difference except in the barotropic files.

Possibilities include:

  1. differences in something not captured in the manifests
  2. actual differences in the barotropic restarts, but they're masked by there always being differences in the md5 sums
  3. weird random glitch in your run Andrew, that hasn't affected Adele

We can't do much about 1.

For 2 I'd be looking at the differences between the barotropic restart files in /scratch/v45/akm157/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers.

So check differences between restart589/ocean/ocean_barotropic.res.nc andrestart588/ocean/ocean_barotropic.res.nc are broadly consistent with the differences between restart588/ocean/ocean_barotropic.res.nc and restart587/ocean/ocean_barotropic.res.nc. No really weird signal/corruption.

You don't have the specific restarts any longer, but you could check out what is available in /g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle3.

So restart587/ocean/ocean_barotropic.res.nc is there. You could check it is consistent with Adele's restart587.

As for 3, well you could try re-running your simulation from your restart587 and see if you can reproduce your own run.

@russfiedler
Copy link

It's in the ice region. Have the ice restarts been checked too?

@aidanheerdegen
Copy link
Contributor

They're covered by the manifests, and don't show differences AFAICT

aekiss added a commit to COSIMA/01deg_jra55_iaf that referenced this issue Sep 13, 2022
@aekiss
Copy link
Contributor Author

aekiss commented Sep 13, 2022

I've done a reproducibility test starting from /g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle3/restart587 here:
https://github.com/COSIMA/01deg_jra55_iaf/tree/01deg_jra55v140_iaf_cycle3_repro_test
/home/156/aek156/payu/01deg_jra55v140_iaf_cycle3_repro_test

Comparing repro test to my original run (01deg_jra55v140_iaf_cycle3..01deg_jra55v140_iaf_cycle3_repro_test) we get

Comparing Adele's run to repro test (01deg_jra55v140_iaf_cycle3_repro_test..01deg_jra55v140_iaf_cycle3_antarctic_tracers) we get

So I think we can conclude

  1. runs are normally reproducible, including barotropic restarts (not sure how Adele's barotropic restarts got altered, but it has no impact on the rest of the model)
  2. something went wrong in run 590 of 01deg_jra55v140_iaf_cycle3, affecting the rest of cycle 3, cycle 4, and the extension to cycle 4.

@aekiss
Copy link
Contributor Author

aekiss commented Sep 13, 2022

Maybe there are clues as to what went wrong in the log files in
/g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle3/output590.

@aekiss
Copy link
Contributor Author

aekiss commented Sep 14, 2022

Maybe the env.yaml files differ?

@aekiss
Copy link
Contributor Author

aekiss commented Sep 14, 2022

Marshall's comments

  • libc change altered a few bits in transcendental functions (e.g. sin function) - was there a libc update?
  • also AVX can cause alignment issues (but we use alignment flag so that's probably ruled out)
  • debugging requires intensive checksumming

@aekiss
Copy link
Contributor Author

aekiss commented Sep 14, 2022

Paul L's comment:

Could be relevant to glibc inconsistencies in transcendental functions: https://stackoverflow.com/questions/71294653/floating-point-inconsistencies-after-upgrading-libc-libm

@aekiss
Copy link
Contributor Author

aekiss commented Sep 14, 2022

diff /g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle3/output589/env.yaml /g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle3/output590/env.yaml

shows nothing suspicious, but env.yaml doesn't capture everything

@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/how-do-i-start-a-new-perturbation-experiment/262/5

@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/1

@aekiss aekiss changed the title non-reproducible run non-reproducible runs Oct 11, 2023
@aekiss
Copy link
Contributor Author

aekiss commented Oct 11, 2023

We have a second example of non-reproducibility: the re-run 01deg_jra55v140_iaf_cycle4_rerun_from_2002 was identical to the original 01deg_jra55v140_iaf_cycle4 for about half the run (from April 2002 until 2011-07-01) but then differs from run 962 onwards.

In both the original and re-run, run 962 was part of a continuous sequence of runs with no crashes, Gadi shutdown, or manual intervention such as queue changes, timestep changes, or core count changes.

Ideas for possible causes:

  1. gadi system change in original run - ruled out because re-run was identical to original for about 10 years
  2. gadi system change in re-run - ruled out by test below
  3. random glitch in original run
  4. random glitch in re-run - ruled out by test below

We can distinguish 2,3,4 by re-running 961 (starting from restart960, 2011-04-01).

  • If restart961 is reproducible, it wasn't a system change in re-run.
  • If restart962 is reproducible (i.e. the same as for the re-run 01deg_jra55v140_iaf_cycle4_rerun_from_2002), there was a glitch in original run; otherwise it's a glitch in re-run

The closest available restart is

/g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle4/restart959

so would need to run from 2011-01-01 (run 960) anyway.

@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/5

aekiss added a commit to COSIMA/01deg_jra55_iaf that referenced this issue Oct 12, 2023
@aekiss
Copy link
Contributor Author

aekiss commented Oct 12, 2023

I've done a reproducibility test 01deg_jra55v140_iaf_cycle4_repro_test in /home/156/aek156/payu/01deg_jra55v140_iaf_cycle4_repro_test, starting from /g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle4/restart959.

This reproduces both original and rerun initial condition md5 hashes (other than for ocean_barotropic.res.nc.*) in manifests/restart.yaml for runs 960, 961, 962, ruling out a gadi system change in re-run (option 2 above).

For run 963 the manifests/restart.yaml initial condition md5 hashes from 01deg_jra55v140_iaf_cycle4_repro_test match the rerun (01deg_jra55v140_iaf_cycle4_rerun_from_2002), but not the original run (01deg_jra55v140_iaf_cycle4).

Therefore 01deg_jra55v140_iaf_cycle4 had a non-reproducible glitch in run 962.

This is unfortunate - it means we can't regenerate sea ice data to match the ocean state in 01deg_jra55v140_iaf_cycle4 from 2011-07-01 onwards. 01deg_jra55v140_iaf_cycle4_rerun_from_2002 didn't save any ocean data, so if we want ocean data consistent with the ice data we'll have to re-run this and find somewhere to store it (about 6Tb).

It also means there's a known flaw in 01deg_jra55v140_iaf_cycle4 from 2011-07-01 onwards (and the follow-on run 01deg_jra55v140_iaf_cycle4_jra55v150_extension), but I expect (although haven't checked) that the initial glitch was a very small perturbation (e.g. an incorrect value in one variable in one grid cell at one timestep), in which case the ocean data we have would still be credible (a different sample from the same statistical distribution in this turbulent flow). We probably should retain this data despite this flaw, as it has been used in publications. This is an analogous situation to the glitch in 01deg_jra55v140_iaf_cycle3 (see above), which affected all subsequent runs.

@aekiss
Copy link
Contributor Author

aekiss commented Oct 13, 2023

sea_level first loses reproducibility on 2011-09-27 simultaneously in both the Arctic and Antarctic (suggesting a CICE or sea ice coupling error). Differences then spread across the globe over the next few days, suggesting a barotropic signal (there may also be some dependence on topography: mid Atlantic ridge apparently shows up on 30 Sept, although not other days). Plot script: https://github.com/aekiss/notebooks/blob/master/01deg_jra55v140_iaf_cycle4_repro_test.ipynb

download

@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/8

@aekiss
Copy link
Contributor Author

aekiss commented Oct 13, 2023

The SST difference also starts as tiny anomalies in both polar regions on 2011-09-27 and then rapidly becomes global.
Plot script: https://github.com/aekiss/notebooks/blob/master/01deg_jra55v140_iaf_cycle4_repro_test.ipynb

download-1

@aekiss
Copy link
Contributor Author

aekiss commented Oct 13, 2023

Note that these plots use single-precision output data, so may be unable to detect the very earliest anomalies in the calculation, which uses double precision.

@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/12

@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants