Add fixes to products for when REPLAY IC's are used #2755

EricSinsky-NOAA · 2024-07-10T17:38:47Z

Description

This PR fixes a couple issues that arise when replay initial conditions are used. These issues only occur when REPLAY_ICS is set to YES and OFFSET_START_HOUR is greater than 0. The following items are addressed in this PR.

Fix issue that causes ocean_prod tasks not to be triggered (issue #2725). A new diag_table was added (called diag_table_replay) that is used only when REPLAY_ICS is set to YES. This diag_table accounts for the offset that occurs when using replay IC's.
Fix issue that causes atmos_prod tasks not to be triggered for the first lead time (e.g. f003) (issue #2754). When OFFSET_START_HOUR is greater than 0, the first fhr is ${OFFSET_START_HOUR}+(${DELTIM}/3600), which is defined in forecast_predet.sh and will allow data for the first lead time to be generated. The filename with this lead time will still be labelled with OFFSET_START_HOUR.
Minor modifications were made to the extractvars task so that atmos data from replay cases can be processed.

This PR was split from PR #2680.

Refs #2725, #2754

Type of change

Bug fix (fixes something broken)

Change characteristics

Is this a breaking change (a change in existing functionality)? NO
Does this change require a documentation update? NO

How has this been tested?

These changes were cloned, built and ran on WCOSS2. These changes were tested by running GEFS.

Checklist

Any dependent changes have been merged and published
My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
New and existing tests pass with my changes
I have made corresponding changes to the documentation if necessary

The linters.yaml has been updated so that shellnorms can run whenever a push occurs to EricSinsky-NOAA/develop.

The changes to linters.yaml have been reverted.

Changes have been added to account for the fixes needed to be made when there is an offset. These changes involve adjusting the atmosphere and ocean fhrs.

aerorahul

See some suggestions.

ush/parsing_namelists_FV3.sh

ush/forecast_postdet.sh

NDATE has been removed and has been replaced with date.

The determination on whether or not the replay_diag_table should be used has been moved to config.fcst.

The dest_file variable has been moved outside of if-statement in forecast_postdet.

A bugfix has been added in exglobal_stage_ic to ensure ${MEMDIR:3} is not intrepreted in base 8.

This reverts commit dcb8066.

A fix has been added for cases where there is an fhr that is a decimal number.

ush/forecast_postdet.sh

christopherwharrop-noaa · 2024-08-09T15:10:08Z

Rocoto has very little control over hanging batch system commands. If a batch system is being hammered, and qstat starts taking a long time to run, all Rocoto can do is time out the commands and try again later.

I will say that PBSPro is highly prone to this type of behavior as it has had scaling/threading problems in the past. Which is one reason why I was very surprised to see PBSPro selected as the batch system for WCOSS. And it is very easy to overload PBSPro with naive user behaviors. Rocoto has already been highly tuned to do everything possible to put as little load as possible on PBSPro as problems were first seen and addressed on Cheyenne.

christopherwharrop-noaa · 2024-08-09T16:04:18Z

If you (or others) have other processes that run qstat commands on WCOSS, you may want to audit those to find out what they are doing. PBSPro is very fragile, and someone running qstat or, even worse, pbsnodes, in the wrong way in automation could very easily cause problems.

TerrenceMcGuinness-NOAA · 2024-08-12T13:22:52Z

The enkfgdascleanup job is PENDING in the queue for the C96C48_hybatmDA case on Hera.
Not sure what launch failed requeued held means.
Not flagged as STALLED because the is still QUEUED when PENDING.

Terry.McGuinness (hfe03) C96C48_hybatmDA_57107f48 $ squeue -u $USER
     JOBID PARTITION  NAME                     USER             STATE        TIME TIME_LIMIT NODES NODELIST(REASON)
  64649768 hera       C96C48_hybatmDA_57107f48 Terry.McGuinness PENDING      0:00      15:00     1 (launch failed requeued held)
  64649463 hera       C96_atmaerosnowDA_57107f Terry.McGuinness PENDING      0:00      30:00    10 (launch failed requeued held)
  
Terry.McGuinness (hfe03) C96C48_hybatmDA_57107f48 $ rocotostat -w C96C48_hybatmDA_57107f48.xml -d C96C48_hybatmDA_57107f48.db | grep QUE
202112210000         enkfgdascleanup                    64649768              QUEUED                   -         0           0.0

Terry.McGuinness (hfe03) C96C48_hybatmDA_57107f48 $ rocotocheck -w C96C48_hybatmDA_57107f48.xml -d C96C48_hybatmDA_57107f48.db -c 202112210000 -t enkfgdascleanup

Task: enkfgdascleanup
  account: nems
  command: /scratch1/NCEPDEV/global/CI/2755/gfs/jobs/rocoto/cleanup.sh
  cores: 1
  cycledefs: gdas
  final: false
  jobname: C96C48_hybatmDA_57107f48_enkfgdascleanup_00
  join: /scratch1/NCEPDEV/global/CI/2755/RUNTESTS/COMROOT/C96C48_hybatmDA_57107f48/logs/2021122100/enkfgdascleanup.log
  maxtries: 2
  memory: 4096M
  name: enkfgdascleanup
  nodes: 1:ppn=1:tpp=1
  partition: hera
  queue: batch
  throttle: 9999999
  walltime: 00:15:00
  environment
    CDATE ==> 2021122100
    COMROOT ==> /scratch1/NCEPDEV/global/CI/2755/RUNTESTS/COMROOT
    DATAROOT ==> /scratch1/NCEPDEV/stmp2/Terry.McGuinness/RUNDIRS/C96C48_hybatmDA_57107f48/enkfgdas.2021122100
    EXPDIR ==> /scratch1/NCEPDEV/global/CI/2755/RUNTESTS/EXPDIR/C96C48_hybatmDA_57107f48
    HOMEgfs ==> /scratch1/NCEPDEV/global/CI/2755/gfs
    NET ==> gfs
    PDY ==> 20211221
    RUN ==> enkfgdas
    RUN_ENVIR ==> emc
    cyc ==> 00
  dependencies
    AND is satisfied
      SOME is satisfied
        enkfgdasearc00 of cycle 202112210000 is SUCCEEDED
        enkfgdasearc01 of cycle 202112210000 is SUCCEEDED

Cycle: 202112210000
  Valid for this task: YES
  State: active
  Activated: 2024-08-08 00:59:14 UTC
  Completed: -
  Expired: -

Job: 64649768
  State:  QUEUED (PENDING)
  Exit Status:  -
  Tries:  0
  Unknown count:  0
  Duration:  0.0
Terry.McGuinness (hfe03) C96C48_hybatmDA_57107f48 $

DavidHuber-NOAA · 2024-08-12T13:41:28Z

@TerrenceMcGuinness-NOAA This looks like an issue with Slurm based on this conversation. Here is an excerpt from Slurm support:

this usually means the launched failed on the compute nodes for one reason or another and instead of relaunching the job on the node and draining the queue it holds it to gain further review from the user or the admin.

What I would do is look at the slurmd log from one of the nodes where the job ran and see why the job failed.

A common mistake is the spool dir isn't owned by user root (or the slurmd isn't ran by user root).  But I am guessing this is a new thing and most other jobs have ran with no issue.

If you could look at the log and see if you can find something let me know.

If you can't easily see anything please attach the slurmd and slurmcltd logs and I can look and see if I can find something.

If you can also attach your slurm.conf file that would be helpful as well.

I'd suggest you report it to RDHPCS.

@DavidHuber-NOAA I will return to this shortly. So far as I can tell (no log and no retires) that this job didn't run Still need to use Slurm queries on the slurm job number.

TerrenceMcGuinness-NOAA · 2024-08-12T14:25:56Z

Made suggested updates to the Rocoto configurations for Timeouts and restated all the CI cases for this PR on WCOSS2:

terry.mcguinness (clogin03) 1.3.5 $ cat rocotorc 
---
:DatabaseType: SQLite3
:WorkflowDocType: XML
:DatabaseServer: true
:BatchQueueServer: true
:WorkflowIOServer: true
:MaxUnknowns: 3
:MaxLogDays: 7
:AutoVacuum: true
:VacuumPurgeDays: 30
:SubmitThreads: 8
:JobQueueTimeout: 120
:JobAcctTimeout: 120

terry.mcguinness (clogin03) 1.3.5 $ crontab -l
MAILTO="[email protected]>"
SHELL=/bin/bash -l
ci_dir=/lfs/h2/emc/global/noscrub/terry.mcguinness/GW/global-workflow_ci/ci/scripts
d_cmd=date +%Y-%m-%d-%H:%M
*/4 * * * * d=$($d_cmd); ${ci_dir}/driver.sh >& ~/ci_logs/bash_driver_${d}.log || echo "ERROR in driver"
*/7 * * * * d=$($d_cmd); ${ci_dir}/run_ci.sh >& ~/ci_logs/run_ci_${d}.log || echo "ERROR in run_ci"
*/6 * * * * d=$($d_cmd); ${ci_dir}/check_ci.sh >& ~/ci_logs/check_ci_${d}.log || echo "ERROR in check_ci"

terry.mcguinness (clogin03) 1.3.5 $ q

cbqs01: 
                                                                 Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
148101461.cbqs01     terry.m* dev      C48_S2SW_* 129237   1 128    --  03:00 R 00:04
148101541.cbqs01     terry.m* dev      C96_atmae* 172194   1 128    --  00:20 R 00:02
148101616.cbqs01     terry.m* dev      C96C48_hy* 117488   4 500    --  01:20 R 00:02
148101617.cbqs01     terry.m* dev      C96C48_hy* 187580   1  80    --  00:15 R 00:02
148101618.cbqs01     terry.m* dev      C96C48_hy*  70962   4 500    --  01:00 R 00:02
148101695.cbqs01     terry.m* dev      C96C48_uf*  67337   1 128    --  00:20 R 00:01
148101706.cbqs01     terry.m* dev      C96C48_uf*  73148   1 128    --  00:20 R 00:01
148101707.cbqs01     terry.m* dev      C96C48_uf* 212313   1 128    --  00:20 R 00:01

terry.mcguinness (clogin03) 1.3.5 $ displaydb
2755 Open Running 0 ci_repo
terry.mcguinness (clogin03) 1.3.5 $

emcbot · 2024-08-12T14:30:22Z

Experiment C96_atm3DVar_extended_d443bf9c STALLED on Wcoss2 at 08/12/24 02:30:21 PM

TerrenceMcGuinness-NOAA · 2024-08-12T14:38:49Z

Confirmed: The STALLED condition on WCOSS2 is a false negative. No UNAVAILBLE were observed, only UNKOWNS. Added tasking to make the STALLED flag more robust.

TerrenceMcGuinness-NOAA · 2024-08-12T15:48:36Z

Got passed PBS/Rocoto anomalies that were leading to false negatives for STALLING because of UNKNOWN/UNAVAILBLE states, and arrived at gsi.x fails. Starting CI up one more time to capture results.

emcbot · 2024-08-12T15:54:24Z

Experiment C48_S2SW_d443bf9c FAIL on Wcoss2 at 08/12/24 03:54:24 PM

Error logs:

/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2755/RUNTESTS/COMROOT/C48_S2SW_d443bf9c/logs/2021032312/gfsfcst.log

Follow link here to view the contents of the above file(s): (link)

KateFriedman-NOAA · 2024-08-12T18:37:46Z

@EricSinsky-NOAA FYI, I have a PR in CI testing that refactors the staging job: #2651

I ended up removing the DTG_PREFIX that is built using cycle and OFFSET_START_HOUR and replaced with model_start_date_current_cycle and associated logic to set that variable. I think my PR may add conflicts with yours if it goes in first. If you have a moment, please review the staging job PR and let me know if we'll need to add the usage of OFFSET_START_HOUR back in and update the REPLAY_ICS blocks in the new yaml (see links below). I was not able to test with REPLAY_ICS=YES for my branch testing.

https://github.com/KateFriedman-NOAA/global-workflow/blob/feature/issue_2475/parm/stage/stage.yaml.j2#L104
https://github.com/KateFriedman-NOAA/global-workflow/blob/feature/issue_2475/parm/stage/stage.yaml.j2#L166

If my PR goes in first, I can work with you to make any needed updates in your branch to accommodate the staging job refactor. Let me know!

emcbot · 2024-08-12T18:57:44Z

CI Passed Hera at
Built and ran in directory /scratch1/NCEPDEV/global/CI/2755


Experiment C48_ATM_57107f48 Completed 1 Cycles: *SUCCESS* at Thu Aug  8 02:19:58 UTC 2024
Experiment C48mx500_3DVarAOWCDA_57107f48 Completed 2 Cycles: *SUCCESS* at Thu Aug  8 02:27:37 UTC 2024
Experiment C96_atm3DVar_57107f48 Completed 3 Cycles: *SUCCESS* at Thu Aug  8 03:41:58 UTC 2024
Experiment C48_S2SWA_gefs_57107f48 Completed 1 Cycles: *SUCCESS* at Thu Aug  8 03:48:04 UTC 2024
Experiment C48_S2SW_57107f48 Completed 1 Cycles: *SUCCESS* at Thu Aug  8 04:06:26 UTC 2024
Experiment C96C48_hybatmDA_57107f48 Completed 3 Cycles: *SUCCESS* at Mon Aug 12 17:14:12 UTC 2024
Experiment C96_atmaerosnowDA_57107f48 Completed 3 Cycles: *SUCCESS* at Mon Aug 12 18:57:35 UTC 2024

WalterKolczynski-NOAA · 2024-08-13T03:10:12Z

Since (a) the previous failures on WCOSS were unrelated to this PR and (b) everything changes is covered by tests run on other machines, I'm going to go ahead and merge this.

FYI: @KateFriedman-NOAA

…e_rocoto * origin/develop: Jenkins Pipeline Updates (NOAA-EMC#2815) Add Gaea C5 to CI (NOAA-EMC#2814) Add support for forecast-only runs on AWS (NOAA-EMC#2711) Add fixes to products for when REPLAY IC's are used (NOAA-EMC#2755) Add capability to run forecast in segments (NOAA-EMC#2795)

EricSinsky-NOAA · 2024-08-14T12:48:35Z

@EricSinsky-NOAA FYI, I have a PR in CI testing that refactors the staging job: #2651

I ended up removing the DTG_PREFIX that is built using cycle and OFFSET_START_HOUR and replaced with model_start_date_current_cycle and associated logic to set that variable. I think my PR may add conflicts with yours if it goes in first. If you have a moment, please review the staging job PR and let me know if we'll need to add the usage of OFFSET_START_HOUR back in and update the REPLAY_ICS blocks in the new yaml (see links below). I was not able to test with REPLAY_ICS=YES for my branch testing.

https://github.com/KateFriedman-NOAA/global-workflow/blob/feature/issue_2475/parm/stage/stage.yaml.j2#L104 https://github.com/KateFriedman-NOAA/global-workflow/blob/feature/issue_2475/parm/stage/stage.yaml.j2#L166

If my PR goes in first, I can work with you to make any needed updates in your branch to accommodate the staging job refactor. Let me know!

@KateFriedman-NOAA Thank you for letting us know about these conflicts with the replay-related variables. We can continue working to resolve these conflicts in @NeilBarton-NOAA's PR #2788.

EricSinsky-NOAA and others added 11 commits June 3, 2024 15:16

Update linters.yaml

9e6206c

The linters.yaml has been updated so that shellnorms can run whenever a push occurs to EricSinsky-NOAA/develop.

Revert changes in linters.yaml

4e13fbe

The changes to linters.yaml have been reverted.

Merge branch 'NOAA-EMC:develop' into develop

1eb0c26

Merge branch 'NOAA-EMC:develop' into develop

720276b

Merge branch 'NOAA-EMC:develop' into develop

219adb2

Merge branch 'NOAA-EMC:develop' into develop

37e2a6f

Merge branch 'NOAA-EMC:develop' into develop

b7ba980

Merge branch 'NOAA-EMC:develop' into develop

a3efe66

Merge branch 'NOAA-EMC:develop' into develop

12e1ec1

Add changes related to offset

557ac3f

Changes have been added to account for the fixes needed to be made when there is an offset. These changes involve adjusting the atmosphere and ocean fhrs.

Merge remote-tracking branch 'EMC/develop' into feature/offsetfixes

6c2b7da

This was referenced Jul 10, 2024

Ocean_prod task not being triggered when REPLAY_ICS is set to YES #2725

Closed

Atmos data for first fhr not being produced when REPLAY_ICS is set to YES #2754

Closed

aerorahul reviewed Jul 11, 2024

View reviewed changes

ush/parsing_namelists_FV3.sh Outdated Show resolved Hide resolved

ush/parsing_namelists_FV3.sh Outdated Show resolved Hide resolved

ush/forecast_postdet.sh Outdated Show resolved Hide resolved

ush/forecast_postdet.sh Outdated Show resolved Hide resolved

EricSinsky-NOAA added 6 commits July 11, 2024 19:43

Merge remote-tracking branch 'EMC/develop' into feature/offsetfixes

15de909

Replace NDATE with date

c1ddb4a

NDATE has been removed and has been replaced with date.

Merge remote-tracking branch 'EMC/develop' into feature/offsetfixes

c1ffe30

Move where replay diag_table is defined

0dba84e

The determination on whether or not the replay_diag_table should be used has been moved to config.fcst.

Move dest_file outside if-statement

fb440c6

The dest_file variable has been moved outside of if-statement in forecast_postdet.

Merge remote-tracking branch 'EMC/develop' into feature/offsetfixes

5f393b1

EricSinsky-NOAA mentioned this pull request Jul 18, 2024

Remove f000 from atmos rocoto tasks for replay cases #2778

Merged

7 tasks

Add bugfix in exglobal_stage_ic

dcb8066

A bugfix has been added in exglobal_stage_ic to ensure ${MEMDIR:3} is not intrepreted in base 8.

EricSinsky-NOAA mentioned this pull request Jul 23, 2024

Product generation for GEFSv13 reforecast #1878

Closed

EricSinsky-NOAA added 2 commits July 24, 2024 13:08

Revert "Add bugfix in exglobal_stage_ic"

2936472

This reverts commit dcb8066.

Merge remote-tracking branch 'EMC/develop' into feature/offsetfixes

8cf65f5

NeilBarton-NOAA mentioned this pull request Jul 24, 2024

Adding a CI for using Replay ICs with an offset hour of 3 #2788

Merged

7 tasks

Add fix for fv3 filename when fhr is decimal

77b00cf

A fix has been added for cases where there is an fhr that is a decimal number.

github-advanced-security bot found potential problems Jul 25, 2024

View reviewed changes

ush/forecast_postdet.sh Fixed Show fixed Hide fixed

EricSinsky-NOAA added 2 commits July 25, 2024 16:28

Address shell check error

d1efec8

Add more fixes for filename when fhr is decimal

78c4b59

DavidHuber-NOAA mentioned this pull request Aug 9, 2024

Hotfix: Handle UNAVAILABLE rocoto status in Bash CI #2820

Merged

5 tasks

TerrenceMcGuinness-NOAA added CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress and removed CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed labels Aug 12, 2024

emcbot added CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed and removed CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress labels Aug 12, 2024

TerrenceMcGuinness-NOAA added CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress and removed CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed labels Aug 12, 2024

emcbot added CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed and removed CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress labels Aug 12, 2024

emcbot added CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully and removed CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress labels Aug 12, 2024

WalterKolczynski-NOAA removed the CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed label Aug 12, 2024

Merge branch 'develop' into feature/offsetfixes

aaab44a

WalterKolczynski-NOAA approved these changes Aug 13, 2024

View reviewed changes

WalterKolczynski-NOAA merged commit 5699167 into NOAA-EMC:develop Aug 13, 2024
5 checks passed

EricSinsky-NOAA deleted the feature/offsetfixes branch August 14, 2024 12:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fixes to products for when REPLAY IC's are used #2755

Add fixes to products for when REPLAY IC's are used #2755

EricSinsky-NOAA commented Jul 10, 2024 •

edited

Loading

aerorahul left a comment

christopherwharrop-noaa commented Aug 9, 2024

christopherwharrop-noaa commented Aug 9, 2024

TerrenceMcGuinness-NOAA commented Aug 12, 2024 •

edited

Loading

DavidHuber-NOAA commented Aug 12, 2024 •

edited by TerrenceMcGuinness-NOAA

Loading

TerrenceMcGuinness-NOAA commented Aug 12, 2024

emcbot commented Aug 12, 2024

TerrenceMcGuinness-NOAA commented Aug 12, 2024 •

edited

Loading

TerrenceMcGuinness-NOAA commented Aug 12, 2024

emcbot commented Aug 12, 2024

KateFriedman-NOAA commented Aug 12, 2024

emcbot commented Aug 12, 2024

WalterKolczynski-NOAA commented Aug 13, 2024

EricSinsky-NOAA commented Aug 14, 2024

Add fixes to products for when REPLAY IC's are used #2755

Add fixes to products for when REPLAY IC's are used #2755

Conversation

EricSinsky-NOAA commented Jul 10, 2024 • edited Loading

Description

Type of change

Change characteristics

How has this been tested?

Checklist

aerorahul left a comment

Choose a reason for hiding this comment

christopherwharrop-noaa commented Aug 9, 2024

christopherwharrop-noaa commented Aug 9, 2024

TerrenceMcGuinness-NOAA commented Aug 12, 2024 • edited Loading

DavidHuber-NOAA commented Aug 12, 2024 • edited by TerrenceMcGuinness-NOAA Loading

TerrenceMcGuinness-NOAA commented Aug 12, 2024

emcbot commented Aug 12, 2024

TerrenceMcGuinness-NOAA commented Aug 12, 2024 • edited Loading

TerrenceMcGuinness-NOAA commented Aug 12, 2024

emcbot commented Aug 12, 2024

KateFriedman-NOAA commented Aug 12, 2024

emcbot commented Aug 12, 2024

WalterKolczynski-NOAA commented Aug 13, 2024

EricSinsky-NOAA commented Aug 14, 2024

EricSinsky-NOAA commented Jul 10, 2024 •

edited

Loading

TerrenceMcGuinness-NOAA commented Aug 12, 2024 •

edited

Loading

DavidHuber-NOAA commented Aug 12, 2024 •

edited by TerrenceMcGuinness-NOAA

Loading

TerrenceMcGuinness-NOAA commented Aug 12, 2024 •

edited

Loading