non-reproducible runs #266

aekiss · 2022-09-09T00:43:27Z

@adele157 is re-running a section of my 01deg_jra55v140_iaf_cycle3 experiment with extra tracers on branch 01deg_jra55v140_iaf_cycle3_antarctic_tracers here
/home/157/akm157/access-om2/01deg_jra55v140_iaf_cycle3_antarctic_tracers/.

Her re-run matches my original up to run 590, but not for run 591 and later.

Note that @adele157 has 2 sets of commits for runs 587-610 on branch 01deg_jra55v140_iaf_cycle3_antarctic_tracers. Ignore the first set - they had the wrong timestep.

Differences in md5 hashes in manifests/restart.yaml indicate bitwise differences in the restarts.
For some reason ocean_barotropic.res.nc md5 hashes never match, but presumably this is harmless if the other restarts match.

Relevant commits (01deg_jra55v140_iaf_cycle3..01deg_jra55v140_iaf_cycle3_antarctic_tracers) are

run 590: git diff -U0 0ab9c24..dcffbd6 manifests/restart.yaml | grep -B 5 md5 | less: md5 differences only in barotropic and tracer-related ocean restarts
run 591: git diff -U0 d09ec5e..942fb38 manifests/restart.yaml | grep -B 5 md5 | less: md5 differences in lots of ocean and ice restarts (presumably all of them?)

So it would seem that something different happened in run 590 so the restarts used by 591 differ.
@adele157 re-ran her 590 and the result was the same as her previous run.
So that seems to indicate something strange happened with my run 590 (0ab9c24).
I can't see anything suspicious for run 590 in my run summary. There are changes to manifests/input.yaml in my runs 588 and 589, but they don't seem relevant.

The text was updated successfully, but these errors were encountered:

aekiss · 2022-09-09T00:48:11Z

Restarts, executables and core count are the same. We would expect this to be reproducible, right? Is this what's tested here? https://accessdev.nci.org.au/jenkins/job/ACCESS-OM2/job/reproducibility/

aekiss · 2022-09-09T01:03:05Z

here's the change in my run 588 COSIMA/01deg_jra55_iaf@0f0ec03
run 589 reverses it COSIMA/01deg_jra55_iaf@9546b24
as we can see from this diff between 587 and 589

aekiss · 2022-09-09T01:07:15Z

@adele157 you said

The problem emerges during run 590, there are differences after day 20.

What variables were you comparing? Are the differences undetectable (bitwise) on day 19?

adele-morrison · 2022-09-09T01:11:39Z

I was just looking (pretty coarsely) at daily temperature output. Not sure how to check for bitwise reproducibility from the output, because I think it's had the precision reduced right?

…

On Fri, 9 Sept 2022 at 11:07, Andrew Kiss ***@***.***> wrote: @adele157 <https://github.com/adele157> you said The problem emerges during run 590, there are differences after day 20. What variables were you comparing? Are the differences undetectable (bitwise) on day 19? — Reply to this email directly, view it on GitHub <#266 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACA44UYSXXSBGCDICFCTFD3V5KEU3ANCNFSM6AAAAAAQIHSLRI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

aekiss · 2022-09-09T01:23:48Z

Yep, outputs are 32-bit (single precision) whereas internally and in restarts it's double-precision.

I meant, was the largest absolute difference in the single precision outputs exactly zero on day 19, and nonzero on day 20? Or was there a detectable difference earlier in the run?

adele-morrison · 2022-09-09T02:08:45Z

I did a more thorough check: There are no differences in the daily averaged output of temperature for the first two days. The difference emerges on day 3 (3rd July 1983) and is present thereafter.

…

On Fri, 9 Sept 2022 at 11:24, Andrew Kiss ***@***.***> wrote: Yep, outputs are 32-bit (single precision) whereas internally and in restarts it's double-precision. I meant, was the largest absolute difference in the single precision outputs exactly zero on day 19, and nonzero on day 20? Or was there a detectable difference earlier in the run? — Reply to this email directly, view it on GitHub <#266 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACA44U4PRK4XF6PGW3NN6WLV5KGS7ANCNFSM6AAAAAAQIHSLRI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

aekiss · 2022-09-09T02:28:26Z

can you show a plot of the difference on day 3?

adele-morrison · 2022-09-09T02:31:30Z

There is only regional output for the new run, so this is the whole domain we have to compare. This is the top ocean level, difference on day 3.

aidanheerdegen · 2022-09-09T06:17:28Z

It is odd.

So if the differences emerge by day 3 of run 590, then it must be in the restarts from run 589 and yet there is no difference except in the barotropic files.

Possibilities include:

differences in something not captured in the manifests
actual differences in the barotropic restarts, but they're masked by there always being differences in the md5 sums
weird random glitch in your run Andrew, that hasn't affected Adele

We can't do much about 1.

For 2 I'd be looking at the differences between the barotropic restart files in /scratch/v45/akm157/access-om2/archive/01deg_jra55v140_iaf_cycle3_antarctic_tracers.

So check differences between restart589/ocean/ocean_barotropic.res.nc andrestart588/ocean/ocean_barotropic.res.nc are broadly consistent with the differences between restart588/ocean/ocean_barotropic.res.nc and restart587/ocean/ocean_barotropic.res.nc. No really weird signal/corruption.

You don't have the specific restarts any longer, but you could check out what is available in /g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle3.

So restart587/ocean/ocean_barotropic.res.nc is there. You could check it is consistent with Adele's restart587.

As for 3, well you could try re-running your simulation from your restart587 and see if you can reproduce your own run.

russfiedler · 2022-09-09T06:22:08Z

It's in the ice region. Have the ice restarts been checked too?

aidanheerdegen · 2022-09-09T06:23:18Z

They're covered by the manifests, and don't show differences AFAICT

… starting from restart587 - see COSIMA/access-om2#266

aekiss · 2022-09-13T23:13:06Z

I've done a reproducibility test starting from /g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle3/restart587 here:
https://github.com/COSIMA/01deg_jra55_iaf/tree/01deg_jra55v140_iaf_cycle3_repro_test
/home/156/aek156/payu/01deg_jra55v140_iaf_cycle3_repro_test

Comparing repro test to my original run (01deg_jra55v140_iaf_cycle3..01deg_jra55v140_iaf_cycle3_repro_test) we get

run 590: git diff -U0 0ab9c24..5072784 manifests/restart.yaml | grep -B 5 md5 | less: no md5 differences (even in barotropic restarts)
run 591: git diff -U0 d09ec5e..eed5041 manifests/restart.yaml | grep -B 5 md5 | less: md5 differences in lots of ocean and ice restarts (presumably all of them?) - so I can't reproduce the restarts from 01deg_jra55v140_iaf_cycle3 run 590

Comparing Adele's run to repro test (01deg_jra55v140_iaf_cycle3_repro_test..01deg_jra55v140_iaf_cycle3_antarctic_tracers) we get

run 590: git diff -U0 5072784..dcffbd6 manifests/restart.yaml | grep -B 5 md5 | less: md5 differences only in barotropic and tracer-related ocean restarts (as expected)
run 591: git diff -U0 eed5041..942fb38 manifests/restart.yaml | grep -B 5 md5 | less: md5 differences only in barotropic and tracer-related ocean restarts - so the non-barotropic restarts from Adele's run 590 are reproducible

So I think we can conclude

runs are normally reproducible, including barotropic restarts (not sure how Adele's barotropic restarts got altered, but it has no impact on the rest of the model)
something went wrong in run 590 of 01deg_jra55v140_iaf_cycle3, affecting the rest of cycle 3, cycle 4, and the extension to cycle 4.

aekiss · 2022-09-13T23:19:13Z

Maybe there are clues as to what went wrong in the log files in
/g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle3/output590.

aekiss · 2022-09-14T01:13:00Z

Maybe the env.yaml files differ?

aekiss · 2022-09-14T01:22:19Z

Marshall's comments

libc change altered a few bits in transcendental functions (e.g. sin function) - was there a libc update?
also AVX can cause alignment issues (but we use alignment flag so that's probably ruled out)
debugging requires intensive checksumming

aekiss · 2022-09-14T01:42:22Z

Paul L's comment:

Could be relevant to glibc inconsistencies in transcendental functions: https://stackoverflow.com/questions/71294653/floating-point-inconsistencies-after-upgrading-libc-libm

aekiss · 2022-09-14T02:34:58Z

diff /g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle3/output589/env.yaml /g/data/cj50/access-om2/raw-output/access-om2-01/01deg_jra55v140_iaf_cycle3/output590/env.yaml

shows nothing suspicious, but env.yaml doesn't capture everything

access-hive-bot · 2023-01-12T04:58:58Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/how-do-i-start-a-new-perturbation-experiment/262/5

access-hive-bot · 2023-10-10T06:36:09Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/1

aekiss · 2023-10-11T05:23:19Z

We have a second example of non-reproducibility: the re-run 01deg_jra55v140_iaf_cycle4_rerun_from_2002 was identical to the original 01deg_jra55v140_iaf_cycle4 for about half the run (from April 2002 until 2011-07-01) but then differs from run 962 onwards.

In both the original and re-run, run 962 was part of a continuous sequence of runs with no crashes, Gadi shutdown, or manual intervention such as queue changes, timestep changes, or core count changes.

Ideas for possible causes:

~~gadi system change in original run~~ - ruled out because re-run was identical to original for about 10 years
~~gadi system change in re-run~~ - ruled out by test below
random glitch in original run
~~random glitch in re-run~~ - ruled out by test below

We can distinguish 2,3,4 by re-running 961 (starting from restart960, 2011-04-01).

If restart961 is reproducible, it wasn't a system change in re-run.
If restart962 is reproducible (i.e. the same as for the re-run 01deg_jra55v140_iaf_cycle4_rerun_from_2002), there was a glitch in original run; otherwise it's a glitch in re-run

The closest available restart is

/g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle4/restart959

so would need to run from 2011-01-01 (run 960) anyway.

access-hive-bot · 2023-10-12T00:22:29Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/5

… starting from restart959 - see COSIMA/access-om2#266 (comment)

aekiss · 2023-10-12T23:34:58Z

I've done a reproducibility test 01deg_jra55v140_iaf_cycle4_repro_test in /home/156/aek156/payu/01deg_jra55v140_iaf_cycle4_repro_test, starting from /g/data/ik11/restarts/access-om2-01/01deg_jra55v140_iaf_cycle4/restart959.

This reproduces both original and rerun initial condition md5 hashes (other than for ocean_barotropic.res.nc.*) in manifests/restart.yaml for runs 960, 961, 962, ruling out a gadi system change in re-run (option 2 above).

For run 963 the manifests/restart.yaml initial condition md5 hashes from 01deg_jra55v140_iaf_cycle4_repro_test match the rerun (01deg_jra55v140_iaf_cycle4_rerun_from_2002), but not the original run (01deg_jra55v140_iaf_cycle4).

Therefore 01deg_jra55v140_iaf_cycle4 had a non-reproducible glitch in run 962.

This is unfortunate - it means we can't regenerate sea ice data to match the ocean state in 01deg_jra55v140_iaf_cycle4 from 2011-07-01 onwards. 01deg_jra55v140_iaf_cycle4_rerun_from_2002 didn't save any ocean data, so if we want ocean data consistent with the ice data we'll have to re-run this and find somewhere to store it (about 6Tb).

It also means there's a known flaw in 01deg_jra55v140_iaf_cycle4 from 2011-07-01 onwards (and the follow-on run 01deg_jra55v140_iaf_cycle4_jra55v150_extension), but I expect (although haven't checked) that the initial glitch was a very small perturbation (e.g. an incorrect value in one variable in one grid cell at one timestep), in which case the ocean data we have would still be credible (a different sample from the same statistical distribution in this turbulent flow). We probably should retain this data despite this flaw, as it has been used in publications. This is an analogous situation to the glitch in 01deg_jra55v140_iaf_cycle3 (see above), which affected all subsequent runs.

aekiss · 2023-10-13T03:16:02Z

sea_level first loses reproducibility on 2011-09-27 simultaneously in both the Arctic and Antarctic (suggesting a CICE or sea ice coupling error). Differences then spread across the globe over the next few days, suggesting a barotropic signal (there may also be some dependence on topography: mid Atlantic ridge apparently shows up on 30 Sept, although not other days). Plot script: https://github.com/aekiss/notebooks/blob/master/01deg_jra55v140_iaf_cycle4_repro_test.ipynb

access-hive-bot · 2023-10-13T03:29:23Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/8

aekiss · 2023-10-13T04:43:46Z

The SST difference also starts as tiny anomalies in both polar regions on 2011-09-27 and then rapidly becomes global.
Plot script: https://github.com/aekiss/notebooks/blob/master/01deg_jra55v140_iaf_cycle4_repro_test.ipynb

aekiss · 2023-10-13T06:23:48Z

Note that these plots use single-precision output data, so may be unable to detect the very earliest anomalies in the calculation, which uses double precision.

access-hive-bot · 2023-10-15T23:01:57Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/12

access-hive-bot · 2023-11-07T00:09:27Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/inconsistent-ocean-and-sea-ice-in-final-7-5yr-of-0-1-iaf-cycle-4/1492/16

aekiss added a commit to COSIMA/01deg_jra55_iaf that referenced this issue Sep 13, 2022

set up 01deg_jra55v140_iaf_cycle3_repro_test to test reproducibility,…

8e9e908

… starting from restart587 - see COSIMA/access-om2#266

aekiss changed the title ~~non-reproducible run~~ non-reproducible runs Oct 11, 2023

aekiss added a commit to COSIMA/01deg_jra55_iaf that referenced this issue Oct 12, 2023

set up 01deg_jra55v140_iaf_cycle4_repro_test to test reproducibility,…

8860a8f

… starting from restart959 - see COSIMA/access-om2#266 (comment)

aekiss mentioned this issue Mar 6, 2024

Not reproducible across restarts #281

Open

aekiss mentioned this issue Aug 3, 2024

Failing the checksum reproducability masks testing actual reproducability ACCESS-NRI/model-config-tests#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-reproducible runs #266

non-reproducible runs #266

aekiss commented Sep 9, 2022 •

edited

Loading

aekiss commented Sep 9, 2022

aekiss commented Sep 9, 2022

aekiss commented Sep 9, 2022

adele-morrison commented Sep 9, 2022 via email

aekiss commented Sep 9, 2022

adele-morrison commented Sep 9, 2022 via email

aekiss commented Sep 9, 2022

adele-morrison commented Sep 9, 2022

aidanheerdegen commented Sep 9, 2022

russfiedler commented Sep 9, 2022

aidanheerdegen commented Sep 9, 2022

aekiss commented Sep 13, 2022

aekiss commented Sep 13, 2022

aekiss commented Sep 14, 2022

aekiss commented Sep 14, 2022

aekiss commented Sep 14, 2022

aekiss commented Sep 14, 2022

access-hive-bot commented Jan 12, 2023

access-hive-bot commented Oct 10, 2023

aekiss commented Oct 11, 2023 •

edited

Loading

access-hive-bot commented Oct 12, 2023

aekiss commented Oct 12, 2023

aekiss commented Oct 13, 2023

access-hive-bot commented Oct 13, 2023

aekiss commented Oct 13, 2023

aekiss commented Oct 13, 2023

access-hive-bot commented Oct 15, 2023

access-hive-bot commented Nov 7, 2023

non-reproducible runs #266

non-reproducible runs #266

Comments

aekiss commented Sep 9, 2022 • edited Loading

aekiss commented Sep 9, 2022

aekiss commented Sep 9, 2022

aekiss commented Sep 9, 2022

adele-morrison commented Sep 9, 2022 via email

aekiss commented Sep 9, 2022

adele-morrison commented Sep 9, 2022 via email

aekiss commented Sep 9, 2022

adele-morrison commented Sep 9, 2022

aidanheerdegen commented Sep 9, 2022

russfiedler commented Sep 9, 2022

aidanheerdegen commented Sep 9, 2022

aekiss commented Sep 13, 2022

aekiss commented Sep 13, 2022

aekiss commented Sep 14, 2022

aekiss commented Sep 14, 2022

aekiss commented Sep 14, 2022

aekiss commented Sep 14, 2022

access-hive-bot commented Jan 12, 2023

access-hive-bot commented Oct 10, 2023

aekiss commented Oct 11, 2023 • edited Loading

access-hive-bot commented Oct 12, 2023

aekiss commented Oct 12, 2023

aekiss commented Oct 13, 2023

access-hive-bot commented Oct 13, 2023

aekiss commented Oct 13, 2023

aekiss commented Oct 13, 2023

access-hive-bot commented Oct 15, 2023

access-hive-bot commented Nov 7, 2023

aekiss commented Sep 9, 2022 •

edited

Loading

aekiss commented Oct 11, 2023 •

edited

Loading