Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What CTSM abort looks like for simulations running when /glade/p Forcing softlink is changed to /glade/campaign #2260

Closed
ekluzek opened this issue Nov 20, 2023 · 8 comments
Assignees
Labels
closed: wontfix We won't fix this issue, because it would be too difficult and/or isn't important enough to fix

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Nov 20, 2023

Brief summary of bug

On Nov/24/2023, we are going to move the softlink that points to the DATM forcing data on /glade/p to where the data is on /glade/campaign on Cheyenne. Any running CTSM simulations will abort with an error when that is done. Users will just need to watch their simulations and restart them from restart file when this happens.

General bug information

CTSM version you are using: ALL VERSIONS!
Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: All with DATM forcing

Details of bug

This change is just being done on Cheyenne and Derecho, and should not affect other machines. izumi for example won't have this happen.

Important details of your setup / configuration so we can reproduce the bug

In this case this will just happen when we change the softlink from pointing to data on /glade/p to /glade/campaign.
Something similar would happen if data is moved or removed in other cases.

Important output or errors that show the problem

cesm.log:

36:  sysmem size=131653.3 MB rss=253.6 MB share=46.7 MB text=20.4 MB datastack=0.0 MB
36:  sysmem size=131653.3 MB rss=253.6 MB share=46.7 MB text=20.4 MB datastack=0.0 MB
36:  sysmem size=131653.3 MB rss=253.6 MB share=46.7 MB text=20.4 MB datastack=0.0 MB
0: ERROR:
0: (shr_strdata_readstrm) ERROR: file does not exist: /glade/campaign/cesm/cesmdat
0: a/cseg/inputdata/atm/datm7/testlink_tests/atm_forcing.datm7.GSWP3.0.5d.v1.c1705
0: 16/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1901-11.nc
0:Image              PC                Routine            Line        Source
0:cesm.exe           00000000013E2426  Unknown               Unknown  Unknown
0:cesm.exe           000000000101F190  shr_abort_mod_mp_         114  shr_abort_mod.F90
0:cesm.exe           0000000000FF26CE  dshr_strdata_mod_        1480  dshr_strdata_mod.F90
0:cesm.exe           0000000000FEADEC  dshr_strdata_mod_        1377  dshr_strdata_mod.F90
0:cesm.exe           0000000000FE4F5A  dshr_strdata_mod_         934  dshr_strdata_mod.F90
0:cesm.exe           000000000056797A  atm_comp_nuopc_mp         659  atm_comp_nuopc.F90
0:cesm.exe           000000000056726B  atm_comp_nuopc_mp         544  atm_comp_nuopc.F90

atm.log:

(shr_strdata_readstrm) reading file ub: /glade/campaign/cesm/cesmdata/cseg/inputdata/atm/datm7/testlink_tests/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/Solar/clmforc.GSWP3.c2011.0.5x0.5.Solr.1901-11.nc      59
 atm : model date        11108       10800
 atm : model date        11108       12600
 atm : model date        11108       14400
(shr_strdata_readstrm) reading file ub: /glade/campaign/cesm/cesmdata/cseg/inputdata/atm/datm7/testlink_tests/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/Precip/clmforc.GSWP3.c2011.0.5x0.5.Prec.1901-11.nc      59
(shr_strdata_readstrm) ERROR: file does not exist: /glade/campaign/cesm/cesmdata/cseg/inputdata/atm/datm7/testlink_tests/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1901-11.nc
 ERROR:
 (shr_strdata_readstrm) ERROR: file does not exist: /glade/campaign/cesm/cesmdat
 a/cseg/inputdata/atm/datm7/testlink_tests/atm_forcing.datm7.GSWP3.0.5d.v1.c1705
 16/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.1901-11.nc

Other log files just end with the last valid message sent to the log file, and don't show an error.

CaseStatus, and the run batch file both log the error with something like this...

2023-11-20 12:53:19: case.run error
ERROR: RUN FAIL: Command 'mpiexec_mpt -p "%g:"  -np 1836  omplace -tm open64 -vv /glade/scratch/erik/testlinks/bld/cesm.exe   >> cesm.log.$LID 2>&1 ' failed
See log file for details: /glade/scratch/erik/testlinks/run/cesm.log.4224350.chadmin1.ib0.cheyenne.ucar.edu.231120-124531
 ---------------------------------------------------

@ekluzek ekluzek added priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations type: -discussion labels Nov 20, 2023
@ekluzek ekluzek self-assigned this Nov 20, 2023
@wwieder
Copy link
Contributor

wwieder commented Nov 20, 2023

Erik, should I email the LMWG about this too?

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 20, 2023

@wwieder I'm working on an email to send out. How about I put together a draft and you send it out to the relevant groups?

@wwieder
Copy link
Contributor

wwieder commented Nov 20, 2023

Sounds good. Do you just want to post it here or email me? Either way, I'd keep it short and just include a reference to this issue.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 20, 2023

Ok I'll post a draft here then. And ill mainly just reference this issue and the Google doc that covers this.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 21, 2023

This is my draft of an email to send out...

to be sent to: tss_staff, tss_visitors, lmwg, ctsm-core?, ctsm-dev...
Subject: Disruption to CTSM "I cases" on Black Friday Nov/24th...
Hello Everyone...
There will be a disruption to CTSM "I cases" running on Cheyenne or Derecho on Black Friday Nov/24th. Cases that are running will abort with an error about missing forcing files.
We are moving forcing data files from /glade/p on Cheyenne to /glade/campaign on this date. When that's done simulations that are running on either machine will die with an error. You simply need to have those simulations restart from the last restart for the case.
See this issue to see what the error will look like...

#2260

Also see this document for more information on our plans at moving LMWG and TSS files from /glade/p to /glade/campaign

https://docs.google.com/document/d/1vL2haCsuNGXGPNCBOf-rdV-4M4-bNVQs38uBi5t8eFg

Let us know if you have questions or concerns on this.
Take care. Happy Thanksgiving for those in the US

@wwieder
Copy link
Contributor

wwieder commented Nov 21, 2023

Thanks Erik, I'll send this out later today.
One question. What about a case I've already run (that presumably points to datm data on glade/p) that I want to continue or re-run after Friday? Can we have softlinks in place so this glade/p paths still point to the data that's been moved to glade/campaign?

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 21, 2023

Yes, having softlinks in place so this all works is the plan. I did show that works in my testcase for example. As we talked about this morning I'll make sure the GSWP3 and CRUNCEP cases both work when I do the switch Friday.

@ekluzek ekluzek added closed: wontfix We won't fix this issue, because it would be too difficult and/or isn't important enough to fix and removed priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations labels Nov 28, 2023
@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 28, 2023

Closing as completed, since we've moved over the forcing directories. This will still be archived for others to see similar problems in the future.

Also note that I filed an issue in CDEPS for the zero file size problem I saw with NLDAS2 data:

ESCOMP/CDEPS#254

@ekluzek ekluzek closed this as completed Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed: wontfix We won't fix this issue, because it would be too difficult and/or isn't important enough to fix
Development

No branches or pull requests

2 participants