Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for model scalability beyond 1800 NTASK (bugzilla #1297) #22

Open
DanIredell-NOAA opened this issue Mar 15, 2022 · 3 comments
Open

Comments

@DanIredell-NOAA
Copy link
Collaborator

http://www2.spa.ncep.noaa.gov/bugzilla/show_bug.cgi?id=1297

Currently we are limited to only running the forecast job using a max of 1800 cores due to Cice's hard set NTASK value of 1800.

This hard limit on scalability makes hard to improve the science(decrease time-step or run it faster) and/or fully utilize resources.

For example:
Current reservation line for rtofs_global_forecast_step2
#PBS -l place=vscatter:excl,select=15:ncpus=120:mpiprocs=120

Using only 120 cores out of the allowed 128 per node.
The code is not memory bound so, 120 cores are idling in this case.

The following would be more efficient, but would require a different NTASK
#PBS -l place=vscatter:excl,select=15:ncpus=128:mpiprocs=128

@DanIredell-NOAA
Copy link
Collaborator Author

First, what we use in operations now is this: (we set exclhost, not excl)
#PBS -l place=vscatter:exclhost,select=15:ncpus=120:mpiprocs=120

Options for the place statement:

Modifer       Meaning
free          Place job on any vnode(s)
pack          All chunks will be taken from one host
scatter       Only one chunk is taken from any host
vscatter      Only one chunk is taken from any vnode.  Each chunk must fit on a vnode.
excl          Only this job uses the vnodes chosen
exclhost      The entire host is allocated to this job
shared        This job can share the vnodes chosen

@DanIredell-NOAA
Copy link
Collaborator Author

Second - we can create another tile layout for HYCOM that is more than 1800 tasks. That would require creating another patch.input and changing about another half dozen parm files (blkdat,input, ice_in). Also the scripts would need modifying to know which set of these files to use (based on NTASK).

And would need another hycom executable as it is compiled with NTASKS set. It is NPX * NPY and in the current case that is 450 * 4. See comp_ice.csh.

@DanIredell-NOAA
Copy link
Collaborator Author

DanIredell-NOAA commented Dec 15, 2023

At the V2.4.0 kickoff meeting it was determined that this would be put on hold until MOM-CICE version planned for RTOFS v3.0 in 2026.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant