Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update WRF_Hydro MPI calls. #552

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from
Draft

Update WRF_Hydro MPI calls. #552

wants to merge 15 commits into from

Conversation

donaldwj
Copy link
Collaborator

@donaldwj donaldwj commented Apr 29, 2021

TYPE: enhancement

KEYWORDS: MPI, Communication, Enhancement

SOURCE: Donald W Johnson, NWC (National Water Center)

DESCRIPTION OF CHANGES: This pull request is intended to replace mpi code in mpp_land.F and other mpp files where mpi_send and mpi_recv are being used for tasks better suited to higher level mpi constructs (mpi_gather[v], mpi_scatter[v], mpi_reduce, mpi_allreduce). The intent is to not change the current behavior of any functions but to improve the efficiency of the underling communication. In some cases this means that internal unit conversions (for example unneeded changes form real*8 to real) will be eliminated and this may slightly change final answers.

ISSUE:

Fixes #551

TESTS CONDUCTED:

Initial Testing was conducted with the croton domain on Cheyenne. Although all test passed on this domain I suspect regression may fail on conus because the updated functions had unnecessary data type changes removed.

####################################

TESTING: --- nwm_ana ---

####################################

echo; echo "Using the following modules for testing:" ; module list; echo;pytest -vv --tb=no --ignore=local -p no:cacheprovider --html=/glade/scratch/donaldj/take_test_croton_NY_mpi_updates/wrfhydro_testing-ifort-nwm_ana.html --self-contained-html --config=nwm_ana --compiler=ifort --domain_dir=/glade/work/jamesmcc/domains/public/croton_NY --candidate_dir=/glade/scratch/donaldj/take_test_croton_NY_mpi_updates/dwj_wrf_hydro_nwm_can_pytest/trunk/NDHMS --reference_dir=/glade/scratch/donaldj/take_test_croton_NY_mpi_updates/reference_ref_pytest/trunk/NDHMS --output_dir=/glade/scratch/donaldj/take_test_croton_NY_mpi_updates --exe_cmd='mpirun -np 6 ./wrf_hydro.exe' --ncores=6 --xrcmp_n_cores=0 --scheduler --nnodes=1 --account=NRAL0017 --walltime=01:00:00 --queue=share

Using the following modules for testing:

Currently Loaded Modules:

  1. intel/18.0.5 2) impi/2018.4.274 3) ncarcompilers/0.5.0 4) netcdf/4.7.4 5) ncarenv/1.3 6) nccmp/1.8.2.1

=========================================================== test session starts ===========================================================
platform linux -- Python 3.6.8, pytest-5.1.2, py-1.8.0, pluggy-0.12.0 -- /glade/p/cisl/nwc/model_testing_env/wrf_hydro_nwm_test/bin/python
metadata: {'Python': '3.6.8', 'Platform': 'Linux-4.12.14-95.51-default-x86_64-with-SuSE-12-x86_64', 'Packages': {'pytest': '5.1.2', 'py': '1.8.0', 'pluggy': '0.12.0'}, 'Plugins': {'html': '1.19.0', 'datadir-ng': '1.1.0', 'metadata': '1.8.0'}, 'JAVA_HOME': '/usr/lib64/jvm/java'}
rootdir: /glade/scratch/donaldj/take_test_croton_NY_mpi_updates/dwj_wrf_hydro_nwm_can_pytest
plugins: html-1.19.0, datadir-ng-1.1.0, metadata-1.8.0
collected 16 items

tests/test_1_fundamental.py::test_compile_candidate PASSED [ 6%]
tests/test_1_fundamental.py::test_compile_reference PASSED [ 12%]
tests/test_1_fundamental.py::test_run_candidate PASSED [ 18%]
tests/test_1_fundamental.py::test_run_reference PASSED [ 25%]
tests/test_1_fundamental.py::test_ncores_candidate PASSED [ 31%]
tests/test_1_fundamental.py::test_perfrestart_candidate PASSED [ 37%]
tests/test_2_regression.py::test_regression_data PASSED [ 43%]
tests/test_2_regression.py::test_regression_metadata PASSED [ 50%]
tests/test_3_outputs.py::test_output_has_nans PASSED [ 56%]
tests/test_supp_1_channel_only.py::test_run_candidate_channel_only PASSED [ 62%]
tests/test_supp_1_channel_only.py::test_channel_only_matches_full PASSED [ 68%]
tests/test_supp_1_channel_only.py::test_ncores_candidate_channel_only PASSED [ 75%]
tests/test_supp_1_channel_only.py::test_perfrestart_candidate_channel_only PASSED [ 81%]
tests/test_supp_2_nwm_output.py::test_run_reference_nwm_output_sim PASSED [ 87%]
tests/test_supp_2_nwm_output.py::test_run_candidate_nwm_output_sim PASSED [ 93%]
tests/test_supp_2_nwm_output.py::test_regression_metadata_nwm_output PASSED [100%]

NOTES:

This pull request will be updated periodically. I intend to only change one or two functions per commit. I am currently keeping the original function definitions in comments below changes, should this be maintained?

Checklist

Merging the PR depends on following checklist being completed. Add X between each of the square
brackets if they are completed in the PR itself. If a bullet is not relevant to you, please comment
on why below the bullet.

@donaldwj
Copy link
Collaborator Author

Note: more commits with other function updates will follow.

@rcabell rcabell force-pushed the master branch 3 times, most recently from 93ccb4d to 385b797 Compare July 8, 2021 01:10
@donaldwj donaldwj changed the title Update definitions of sum_real8 and sumreal1d. Update WRF_Hydro MPI calls. Aug 9, 2021
…the size and displacement vectors needed by scatterv and gatherv.
@scrasmussen
Copy link
Member

This looks good and works well, thanks @donaldwj!!
I compared the results of this PR with main using Croton and got identical results.

@rcabell, I believe you had mentioned that it is a preference to simply delete the old code instead of commenting out sections, do we want that done with lines like 408-454, 647-663 of this PR?

@donaldwj
Copy link
Collaborator Author

donaldwj commented Aug 16, 2021 via email

@scrasmussen
Copy link
Member

I've looked over, built, and ran on the Croton test domain. Everything built and ran successfully so I'm postulating that everything is working correctly. I've added to my todo list to make some nice unit tests for these changes but that also involves creating the structure for unit tests so I recommend accepting this PR in the meantime. All looking good : )

scrasmussen
scrasmussen previously approved these changes Aug 20, 2021
Copy link
Member

@scrasmussen scrasmussen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good and builds correctly. I tested on the Croton domain and am getting the same results pre and post PR. All recommendations have been followed and everything seems good! These changes will help performance 👍

@donaldwj
Copy link
Collaborator Author

Is there a simple way to get the testing framework to do CONUS runs. In particular I want to see if there is noticeable speed change.

@rcabell
Copy link
Collaborator

rcabell commented Aug 23, 2021

Is there a simple way to get the testing framework to do CONUS runs. In particular I want to see if there is noticeable speed change.

We have a mirror system of the GitHub CI tests on Cheyenne that does 48 hour Full and Long-Range Physics tests. I just started it and can have timing data for you soon.

@rcabell
Copy link
Collaborator

rcabell commented Sep 14, 2021

Something strange is going on with this update when run over CONUS, probably a race condition or something. If I re-run the tests several times, usually they hang on the first run_candidate stage. But other times (1 in 3 or 4) the testing will continue to another subsequent run of the model, where it will hang. I managed to get Long-Range to go to completion, but not full-physics. Croton always appears to succeed, regardless of compiler or MPI library. We may need to find an intermediate-complexity domain to figure out what's going on.

@donaldwj
Copy link
Collaborator Author

donaldwj commented Sep 14, 2021 via email

@rcabell rcabell marked this pull request as draft April 13, 2022 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants