Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Restart File Names for MOM6 in payu 1.1 #430

Closed
ezhilsabareesh8 opened this issue Mar 22, 2024 · 12 comments · Fixed by #432
Closed

Incorrect Restart File Names for MOM6 in payu 1.1 #430

ezhilsabareesh8 opened this issue Mar 22, 2024 · 12 comments · Fixed by #432
Assignees

Comments

@ezhilsabareesh8
Copy link
Contributor

In the new PayU version 1.1, it has been observed that the restart file names for the MOM6 are incorrect. This issue causes MOM6 to look for files with incorrect filenames, leading to warnings such as:

WARNING from PE 0: MOM_restart: Unable to find restart file : ./GMOM_JRA.mom6.r.1900-01-02-00000_1.nc.nc
WARNING from PE 0: MOM_restart: Unable to find restart file : ./GMOM_JRA.mom6.r.1900-01-02-00000_2.nc.nc
WARNING from PE 0: MOM_restart: Unable to find restart file : ./GMOM_JRA.mom6.r.1900-01-02-00000_3.nc.nc
WARNING from PE 0: MOM_restart: Unable to find restart file : ./GMOM_JRA.mom6.r.1900-01-02-00000_4.nc.nc

As seen in the warning messages, the file extension .nc.nc is incorrect and seems to be duplicated, resulting in MOM6 being unable to locate the required restart files.

@minghangli-uni
Copy link
Contributor

@ezhilsabareesh8 I did a quick check using payu 1.1 but couldnt reproduce your error. Starting fresh with a clean clone of ryf might resolve the issue.

Below is what I recieved in my access-om3.out,

NOTE from PE 0: MOM_restart: MOM run restarted using : ./access-om3.mom6.r.1900-02-01-00000.nc

@aekiss
Copy link
Contributor

aekiss commented Mar 26, 2024

@minghangli-uni hits this bug only for 0.25 deg, not 1 deg: COSIMA/access-om3#101 (comment)

Could it be a MOM6 configuration problem in 0.25 deg? Here's a comparison between 1deg and 0.25deg:
ACCESS-NRI/access-om3-configs@1deg_jra55do_ryf...025deg_jra55do_ryf_iss101

@aekiss
Copy link
Contributor

aekiss commented Mar 26, 2024

@aidanheerdegen
Copy link
Collaborator

Sounds like something to add to #421 if that is something you need to always be set to a particular value.

@anton-seaice
Copy link
Collaborator

anton-seaice commented Mar 26, 2024

In this configuration, MOM is producing 5 restart files:

$ cat rpointer.ocn 
access-om3.mom6.r.1900-02-01-00000.nc
access-om3.mom6.r.1900-02-01-00000_1.nc
access-om3.mom6.r.1900-02-01-00000_2.nc
access-om3.mom6.r.1900-02-01-00000_3.nc
access-om3.mom6.r.1900-02-01-00000_4.nc

They are formatted 64-bit offset and have size 3.6GB. I think the maximum size for netcdf 64-bit-offset is 3.6GB, which might be why there are 5 files. (It looks like FMS configs produce multiple restart files too, just they are labelled differently).

However, payu (I guess), is not moving the files correctly after a run:

$ ls restart000/
access-om3.cice.r.1900-02-01-00000.nc  access-om3.datm.r.1900-02-01-00000.nc  access-om3.mom6.r.1900-02-01-00000.nc  rpointer.cpl  rpointer.ocn
access-om3.cpl.r.1900-02-01-00000.nc   access-om3.drof.r.1900-02-01-00000.nc  rpointer.atm                           rpointer.ice  rpointer.rof
$ ls output000/access-om3.mom6.*
output000/access-om3.mom6.h.native_1900_01.nc  output000/access-om3.mom6.h.static.nc     output000/access-om3.mom6.r.1900-02-01-00000_1.nc  output000/access-om3.mom6.r.1900-02-01-00000_3.nc
output000/access-om3.mom6.h.sfc_1900_01.nc     output000/access-om3.mom6.h.z_1900_01.nc  output000/access-om3.mom6.r.1900-02-01-00000_2.nc  output000/access-om3.mom6.r.1900-02-01-00000_4.nc

Note how restart files 1 ... 4 are in the output folder, not the restart folder.

Is it possible to configure MOM6 to use netcdf4? If not, I guess a payu update is needed?

p.s. I tested this, and the model starts from the restart if I manually moved the four extra _ restart files to the restart directory000 and then run the model.

@anton-seaice
Copy link
Collaborator

I guess this line should allow multiple lines in the pointer file and iterate over them:

restart = f.readline().rstrip()

@dougiesquire
Copy link
Collaborator

Whoops, I didn't know this happened and so didn't account for multiple restart files. I can fix up

@anton-seaice
Copy link
Collaborator

Thanks Dougie :)

@minghangli-uni
Copy link
Contributor

p.s. I tested this, and the model starts from the restart if I manually moved the four extra _ restart files to the restart directory000 and then run the model.

I can see MOM can read restart files after moving the extra to the restart dir,

NOTE from PE     0: MOM_restart: MOM run restarted using : ./GMOM_JRA.mom6.r.1900-01-02-00000.nc
NOTE from PE     0: MOM_restart: MOM run restarted using : ./GMOM_JRA.mom6.r.1900-01-02-00000_1.nc
NOTE from PE     0: MOM_restart: MOM run restarted using : ./GMOM_JRA.mom6.r.1900-01-02-00000_2.nc

But I received errors in the access-om3.err,

get_stripe failed: 61 (No data available)
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832

@anton-seaice Can you please confirm you dont have such errors?

@anton-seaice
Copy link
Collaborator

Hi Minghang. That is a bug in openmpi which prevents it doing a parallel read of files referenced through symlinks. CICE is trying to do a parallel read of ./GMOM_JRA.cice.r.*

We put a patch in the MOM6-CICE6 config (ACCESS-NRI/access-om3-configs#24) whilst waiting for the openmpi 4.1.7 release which will fix this.

You just need to check that the paths in setup_cice_restarts.sh are correct and your config.yaml is still calling it (https://github.com/COSIMA/MOM6-CICE6/blob/c2585c7ddcad8c56d44026835cfd62c2800b645f/config.yaml#L33)

@minghangli-uni
Copy link
Contributor

Fixed by substituting access-om3 with GMOM_JRAin setup_cice_restarts.sh.

@dougiesquire
Copy link
Collaborator

Fixed by substituting access-om3 with GMOM_JRAin setup_cice_restarts.sh.

@minghangli-uni it sounds like you may need to get your configuration up to date with what's on github

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants