Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible issue with exchange grid?: Zero values sent to atm in grid cells with ifrac = 1 #510

Closed
billsacks opened this issue Oct 1, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@billsacks
Copy link
Member

I should say up-front that I'm not sure if this is actually a bug with aoflux_grid=xgrid, but it is a difference between aoflux_grid=xgrid and aoflux_grid=ogrid, and it seems like it might be contributing to a model crash. So I'm opening this to start a discussion on whether this might be a bug.

I'll start with the conclusion before diving into details: It seems like, with aoflux_grid=ogrid, we get 0 values sent to atm from the atm-ocn flux calculations over land points, but NOT over 100% ice points; but with aoflux_grid=xgrid, we get 0 values sent to atm from the atm-ocn flux calculations over any points with (lnd+ice)=1 (i.e., any points with ofrac=0). This MAY be contributing to a crash in some configurations with xgrid, though I'm not sure yet if that's the cause. @jedwards4b @mvertens @uturuncoglu @DeniseWorthen - do any of you have a sense of what's right here?

This started with an investigation of a failure in the test SMS_D_Ln9.f09_f09_mg17.FCnudged_GC.derecho_intel.cam-outfrq9s. I'm running this from cesm3_0_alpha03c but with CMEPS updated to cmeps1.0.18 (which is needed to fix some other issues with the exchange grid). Out-of-the-box, this test passes. However, when adding the following to user_nl_cpl – aoflux_grid = "xgrid" – this test fails with a divide by zero in drydep_mod:

dec2455.hsn.de.hpc.ucar.edu 484: forrtl: error (73): floating divide by zero
dec2455.hsn.de.hpc.ucar.edu 484: Image              PC                Routine            Line        Source
dec2455.hsn.de.hpc.ucar.edu 484: libpthread-2.31.s  0000149FBD5108C0  Unknown               Unknown  Unknown
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           000000000315AECC  drydep_mod_mp_adu        4108  drydep_mod.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           000000000313FCB0  drydep_mod_mp_dep        1774  drydep_mod.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           000000000311CE7A  drydep_mod_mp_do_         316  drydep_mod.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           0000000002C89ECB  chemistry_mp_chem        3514  chemistry.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           00000000012ACFAC  physpkg_mp_tphysa        1604  physpkg.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           00000000012A7F77  physpkg_mp_phys_r        1284  physpkg.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           00000000009A9FA1  cam_comp_mp_cam_r         290  cam_comp.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           0000000000959FFF  atm_comp_nuopc_mp        1136  atm_comp_nuopc.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6533D86  execute                   377  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6533932  execute                   563  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC653352A  c_esmc_methodtabl         317  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC68D388B  esmf_attachmethod        1287  ESMF_AttachMethods.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC80493FD  Unknown               Unknown  Unknown
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618F279  callVFuncPtr             2167  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618E2B8  ESMCI_FTableCallE         824  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC662BAB2  enter                    2501  ESMCI_VMKernel.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6614346  enter                    1216  ESMCI_VM.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618F65F  c_esmc_ftablecall         981  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6C134FC  esmf_compmod_mp_e        1252  ESMF_Comp.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC74E3D6A  esmf_gridcompmod_        1903  ESMF_GridComp.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC7F85B75  nuopc_driver_mp_r        3694  NUOPC_Driver.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC7F8BDFA  nuopc_driver_mp_e        3940  NUOPC_Driver.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6533D86  execute                   377  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6533932  execute                   563  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC653352A  c_esmc_methodtabl         317  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC68D388B  esmf_attachmethod        1287  ESMF_AttachMethods.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC7F83B76  nuopc_driver_mp_r        3615  NUOPC_Driver.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618F279  callVFuncPtr             2167  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618E2B8  ESMCI_FTableCallE         824  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC662BAB2  enter                    2501  ESMCI_VMKernel.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6614346  enter                    1216  ESMCI_VM.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618F65F  c_esmc_ftablecall         981  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6C134FC  esmf_compmod_mp_e        1252  ESMF_Comp.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC74E3D6A  esmf_gridcompmod_        1903  ESMF_GridComp.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC7F85B75  nuopc_driver_mp_r        3694  NUOPC_Driver.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC7F8BDFA  nuopc_driver_mp_e        3940  NUOPC_Driver.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6533D86  execute                   377  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6533932  execute                   563  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC653352A  c_esmc_methodtabl         317  ESMCI_MethodTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC68D388B  esmf_attachmethod        1287  ESMF_AttachMethods.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC7F83B76  nuopc_driver_mp_r        3615  NUOPC_Driver.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618F279  callVFuncPtr             2167  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618E2B8  ESMCI_FTableCallE         824  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC662BAB2  enter                    2501  ESMCI_VMKernel.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6614346  enter                    1216  ESMCI_VM.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC618F65F  c_esmc_ftablecall         981  ESMCI_FTable.C
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC6C134FC  esmf_compmod_mp_e        1252  ESMF_Comp.F90
dec2455.hsn.de.hpc.ucar.edu 484: libesmf.so         0000149FC74E3D6A  esmf_gridcompmod_        1903  ESMF_GridComp.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           000000000044E467  MAIN__                    141  esmApp.F90
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           0000000000425D7D  Unknown               Unknown  Unknown
dec2455.hsn.de.hpc.ucar.edu 484: libc-2.31.so       0000149FB8E0229D  __libc_start_main     Unknown  Unknown
dec2455.hsn.de.hpc.ucar.edu 484: cesm.exe           0000000000425CAA  Unknown               Unknown  Unknown

which is here:

    ! surface resistance for particle
    RS   = 1.e0_f8 / (E0 * USTAR * (EB + EIM + EIN) * R1 )

As a side-note: This test also fails in non-debug mode, though more cryptically (so I'm not positive what's going on here):

dec0891.hsn.de.hpc.ucar.edu 12: MPICH ERROR [Rank 12] [job id 8a79c3d6-fcc1-4b88-bf53-101d7de9bc46] [Mon Sep 30 15:57:48 2024] [dec0891] - Abort(3218063) (rank 12 in comm 0): Fatal error in PMPI_Alltoallv: Other MPI error, error stack:
dec0891.hsn.de.hpc.ucar.edu 12: PMPI_Alltoallv(386)............: MPI_Alltoallv(sbuf=0x14ac2a7e3800, scnts=0x14ac2dcab580, sdispls=0x14ac2dcaab00, dtype=0x4c000829, rbuf=0x14ac2afca840, rcnts=0x14ac2dcaa080, rdispls=0x14ac2dca9600, datatype=dtype=0x4c000829, comm=comm=0xc400000f) failed
dec0891.hsn.de.hpc.ucar.edu 12: MPIR_CRAY_Alltoallv(1187)......:
dec0891.hsn.de.hpc.ucar.edu 12: MPIR_Waitall(167)..............:
dec0891.hsn.de.hpc.ucar.edu 12: MPIR_Waitall_impl(51)..........:
dec0891.hsn.de.hpc.ucar.edu 12: MPID_Progress_wait(201)........:
dec0891.hsn.de.hpc.ucar.edu 12: MPIDI_Progress_test(97)........:
dec0891.hsn.de.hpc.ucar.edu 12: MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)
dec0891.hsn.de.hpc.ucar.edu 12:
dec0891.hsn.de.hpc.ucar.edu 12: aborting job:
dec0891.hsn.de.hpc.ucar.edu 12: Fatal error in PMPI_Alltoallv: Other MPI error, error stack:
dec0891.hsn.de.hpc.ucar.edu 12: PMPI_Alltoallv(386)............: MPI_Alltoallv(sbuf=0x14ac2a7e3800, scnts=0x14ac2dcab580, sdispls=0x14ac2dcaab00, dtype=0x4c000829, rbuf=0x14ac2afca840, rcnts=0x14ac2dcaa080, rdispls=0x14ac2dca9600, datatype=dtype=0x4c000829, comm=comm=0xc400000f) failed
dec0891.hsn.de.hpc.ucar.edu 12: MPIR_CRAY_Alltoallv(1187)......:
dec0891.hsn.de.hpc.ucar.edu 12: MPIR_Waitall(167)..............:
dec0891.hsn.de.hpc.ucar.edu 12: MPIR_Waitall_impl(51)..........:
dec0891.hsn.de.hpc.ucar.edu 12: MPID_Progress_wait(201)........:
dec0891.hsn.de.hpc.ucar.edu 12: MPIDI_Progress_test(97)........:
dec0891.hsn.de.hpc.ucar.edu 12: MPIDI_OFI_handle_cq_error(1067): OFI poll failed (ofi_events.c:1069:MPIDI_OFI_handle_cq_error:Input/output error - PTLTE_NOT_FOUND)

This test passes with aoflux_grid = "ogrid", and a similar test without chemistry – SMS_D_Ln9.f09_f09_mg17.FHIST.derecho_intel.cam-outfrq9s – passes even with xgrid.

I haven't dug deeply into the relevant CAM code, but I decided to look for differences in variables sent to the atmosphere in the test SMS_D_Ln9.f09_f09_mg17.FHIST.derecho_intel.cam-outfrq9s with xgrid vs. ogrid. I started by looking at ustar, since that's one of the terms in the line with divide-by-zero (though note that I have not confirmed that this is the term causing the problem).

Here is ustar sent to atm using xgrid:

Pasted image 20240930175002

And here using ogrid:

Pasted image 20240930175021

Both have 0 values over land, but note that the xgrid run has additional 0 values in the Arctic ocean and near Antarctica. I see these same extra 0 values in Med_aoflux_atm_So_ustar, but not in Med_aoflux_ocn_So_ustar. I spot-checked some other fields from the atm-ocn flux calculation and see the same thing with other fields.

Here is a map showing where ofrac is essentially 0:
Pasted image 20241001163531

And here are points where ifrac is essentially 1:
Pasted image 20241001163604

By eye, these seem to match up very well with the grid cells that have 0 values in the run with aoflux_grid=xgrid. This leads me to the conclusion at the top of the issue.

@billsacks
Copy link
Member Author

After some consultation with @mvertens and additional testing, we feel that xgrid is working as intended here. The 0 values over sea ice grid cells also appear in runs with aoflux_grid = "xgrid" but where the atmosphere and ocean are running on different grids. CAM crashes in the same way in SMS_D_Ln9.f09_g17.FCnudged_GC.derecho_intel.cam-outfrq9s with aoflux_grid = "ogrid" as I noted above with SMS_D_Ln9.f09_f09_mg17.FCnudged_GC.derecho_intel.cam-outfrq9s with "xgrid". I believe this is a CAM issue, unless CAM wants to push for changes in the long-standing behavior of the mediator in this respect (which could be a reasonable solution). So I have moved this to a CAM issue:

ESCOMP/CAM#1172

@billsacks billsacks closed this as not planned Won't fix, can't repro, duplicate, stale Oct 14, 2024
@github-project-automation github-project-automation bot moved this from Todo ~ weeks to Done (or no longer holding things up) in CESM: infrastructure / cross-component SE priorities Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done (or no longer holding things up)
Development

No branches or pull requests

1 participant