-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-reproducible runs #266
Comments
Restarts, executables and core count are the same. We would expect this to be reproducible, right? Is this what's tested here? https://accessdev.nci.org.au/jenkins/job/ACCESS-OM2/job/reproducibility/ |
here's the change in my run 588 COSIMA/01deg_jra55_iaf@0f0ec03 |
@adele157 you said
What variables were you comparing? Are the differences undetectable (bitwise) on day 19? |
I was just looking (pretty coarsely) at daily temperature output. Not sure
how to check for bitwise reproducibility from the output, because I think
it's had the precision reduced right?
…On Fri, 9 Sept 2022 at 11:07, Andrew Kiss ***@***.***> wrote:
@adele157 <https://github.com/adele157> you said
The problem emerges during run 590, there are differences after day 20.
What variables were you comparing? Are the differences undetectable
(bitwise) on day 19?
—
Reply to this email directly, view it on GitHub
<#266 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACA44UYSXXSBGCDICFCTFD3V5KEU3ANCNFSM6AAAAAAQIHSLRI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yep, outputs are 32-bit (single precision) whereas internally and in restarts it's double-precision. I meant, was the largest absolute difference in the single precision outputs exactly zero on day 19, and nonzero on day 20? Or was there a detectable difference earlier in the run? |
I did a more thorough check: There are no differences in the daily averaged
output of temperature for the first two days. The difference emerges on day
3 (3rd July 1983) and is present thereafter.
…On Fri, 9 Sept 2022 at 11:24, Andrew Kiss ***@***.***> wrote:
Yep, outputs are 32-bit (single precision) whereas internally and in
restarts it's double-precision.
I meant, was the largest absolute difference in the single precision
outputs exactly zero on day 19, and nonzero on day 20? Or was there a
detectable difference earlier in the run?
—
Reply to this email directly, view it on GitHub
<#266 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACA44U4PRK4XF6PGW3NN6WLV5KGS7ANCNFSM6AAAAAAQIHSLRI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
can you show a plot of the difference on day 3? |
It is odd. So if the differences emerge by day 3 of run 590, then it must be in the restarts from run 589 and yet there is no difference except in the barotropic files. Possibilities include:
We can't do much about 1. For 2 I'd be looking at the differences between the barotropic restart files in So check differences between You don't have the specific restarts any longer, but you could check out what is available in So As for 3, well you could try re-running your simulation from your restart587 and see if you can reproduce your own run. |
It's in the ice region. Have the ice restarts been checked too? |
They're covered by the manifests, and don't show differences AFAICT |
… starting from restart587 - see COSIMA/access-om2#266
I've done a reproducibility test starting from Comparing repro test to my original run (
Comparing Adele's run to repro test (
So I think we can conclude
|
Maybe there are clues as to what went wrong in the log files in |
Maybe the |
Marshall's comments
|
Paul L's comment:
|
shows nothing suspicious, but |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: https://forum.access-hive.org.au/t/how-do-i-start-a-new-perturbation-experiment/262/5 |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: |
We have a second example of non-reproducibility: the re-run In both the original and re-run, run 962 was part of a continuous sequence of runs with no crashes, Gadi shutdown, or manual intervention such as queue changes, timestep changes, or core count changes. Ideas for possible causes:
We can distinguish 2,3,4 by re-running 961 (starting from restart960, 2011-04-01).
The closest available restart is
so would need to run from 2011-01-01 (run 960) anyway. |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: |
… starting from restart959 - see COSIMA/access-om2#266 (comment)
I've done a reproducibility test This reproduces both original and rerun initial condition md5 hashes (other than for For run 963 the Therefore This is unfortunate - it means we can't regenerate sea ice data to match the ocean state in It also means there's a known flaw in |
|
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: |
The SST difference also starts as tiny anomalies in both polar regions on 2011-09-27 and then rapidly becomes global. |
Note that these plots use single-precision output data, so may be unable to detect the very earliest anomalies in the calculation, which uses double precision. |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: |
Moving a private Slack chat here.
@adele157 is re-running a section of my
01deg_jra55v140_iaf_cycle3
experiment with extra tracers on branch01deg_jra55v140_iaf_cycle3_antarctic_tracers
here/home/157/akm157/access-om2/01deg_jra55v140_iaf_cycle3_antarctic_tracers/
.Her re-run matches my original up to run 590, but not for run 591 and later.
Note that @adele157 has 2 sets of commits for runs 587-610 on branch
01deg_jra55v140_iaf_cycle3_antarctic_tracers
. Ignore the first set - they had the wrong timestep.Differences in md5 hashes in
manifests/restart.yaml
indicate bitwise differences in the restarts.For some reason
ocean_barotropic.res.nc
md5 hashes never match, but presumably this is harmless if the other restarts match.Relevant commits (
01deg_jra55v140_iaf_cycle3
..01deg_jra55v140_iaf_cycle3_antarctic_tracers
) aregit diff -U0 0ab9c24..dcffbd6 manifests/restart.yaml | grep -B 5 md5 | less
: md5 differences only in barotropic and tracer-related ocean restartsgit diff -U0 d09ec5e..942fb38 manifests/restart.yaml | grep -B 5 md5 | less
: md5 differences in lots of ocean and ice restarts (presumably all of them?)So it would seem that something different happened in run 590 so the restarts used by 591 differ.
@adele157 re-ran her 590 and the result was the same as her previous run.
So that seems to indicate something strange happened with my run 590 (
0ab9c24
).I can't see anything suspicious for run 590 in my run summary. There are changes to
manifests/input.yaml
in my runs 588 and 589, but they don't seem relevant.The text was updated successfully, but these errors were encountered: