Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cam6_4_052: clubb_intr GPUization #1175

Merged
merged 20 commits into from
Jan 6, 2025

Conversation

huebleruwm
Copy link

This only modifies clubb_intr.F90 and doesn't require a new verseion of clubb. The purpose of this is the addition of acc directives, added in order to offload computations to GPUs. Besides the directives, this mainly consists of replacing vector notation with explicit loops, combining loops with the same bounds where possible, and moving non-gpuized function calls to outside of the GPU section. I also added some new notation for the number of vertical levels (nzm_clubb and nzt_clubb) that improves readability and will make it easier to merge in with future versions of clubb. I also included some timing statements, similar to the ones added in the Earthworks ew-develop branch, which this version of clubb_intr is also compatible with.

This is BFB on CPUs (tested with intel), runs with intel+debugging, and passes the ECT test when comparing CPU results to GPU results (using cam7). There's some options that I didn't GPUize or test (do_clubb_mf, do_rainturb, do_cldcool, clubb_do_icesuper, single_column ), so I left the code for them untouched and added some checks to stop the run if they're set when the code is compiled with OpenACC.

If there ends up being something wrong with these changes then this version, which is an earlier commit that contains only a new OpenACC data statement and some timer additions, would be nice to get in at least.

@Katetc Katetc self-requested a review October 21, 2024 21:43
@cacraigucar cacraigucar requested a review from nusbaume October 22, 2024 15:10
Copy link
Collaborator

@nusbaume nusbaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I had some questions and change requests but none of them are required, and of course if you have any concerns with any of my requests then just let me know. Thanks!

src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
… in clubb_ini_cam. Reusing inv_exner_clubb(:,pver) to set inv_exner_clubb_surf. Splitting up array initialization loop to slightly improve CPU performance.
@huebleruwm
Copy link
Author

Looks good to me! I had some questions and change requests but none of them are required, and of course if you have any concerns with any of my requests then just let me know. Thanks!

Good finds, I took you up on every suggestion.

@huebleruwm
Copy link
Author

huebleruwm commented Oct 28, 2024

I ran a PFS test to check how the performance changed, and initially found that clubb_tend_cam was ~8% slower with these changes (up to cfd2824). Which is unacceptable. I took a guess that the biggest performance killer was a massive loop I made to replace a large amount of vector notation, where the code just zeros out a bunch of arrays. That seemed to be the culprit, since I split the loop up here around line 2991, and the result was slightly faster than the original code.

Here's the timing output comparison now. I ran these a couple times and got roughly the same results.

From the cam_dev head I started with (cam6_4_038)

Region                                         PETs   PEs    Count    Mean (s)    Min (s)     Min PET Max (s)     Max PET
macrop_tend                                    512    512    MULTIPLE 135.3464    124.0819    366     146.0524    42     
  clubb_tend_cam                               512    512    MULTIPLE 132.0139    121.0235    366     142.8016    166    
    clubb_tend_cam_i_loop                      512    512    MULTIPLE 111.1615    102.4363    366     123.0577    166    

From the head of this branch (75f51d1)

Region                                         PETs   PEs    Count    Mean (s)    Min (s)     Min PET Max (s)     Max PET
macrop_tend                                    512    512    MULTIPLE 132.7466    123.5249    366     144.4218    149    
  clubb_tend_cam                               512    512    MULTIPLE 129.4563    120.4303    366     140.8711    149    
    clubb_tend_cam:ACCR                        512    512    MULTIPLE 107.8117    100.6424    366     117.3765    149    
      clubb_tend_cam:advance_clubb_core_api    512    512    MULTIPLE 91.2902     84.0197     366     100.7022    149    
      clubb_tend_cam:flip-index                512    512    MULTIPLE 11.2684     9.9628      220     13.7559     455    
   clubb_tend_cam:NAR                          512    512    MULTIPLE 21.4333     19.0054     401     24.0401     76     
                      qneg3                    512    512    MULTIPLE 0.5783      0.5268      474     0.6311      66     
   clubb_tend_cam:acc_copyin                   512    512    MULTIPLE 0.0059      0.0045      358     0.0081      260    
   clubb_tend_cam:acc_copyout                  512    512    MULTIPLE 0.0038      0.0025      314     0.0061      149    

So it seems these changes made clubb_tend_cam faster by ~2%.

I ran all the same tests mentioned above to confirm these new changes.

Copy link
Collaborator

@Katetc Katetc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice cleanup and the GPU code is not too obtrusive. I give it five stars!

My comments are mostly questions so I can understand what is going on as best as possible.

I am only about 96% confidant that the vertical dimension changes are correct through the whole file. I think the regression tests will be necessary to make sure we have that right in every case.

src/physics/cam/clubb_intr.F90 Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
src/physics/cam/clubb_intr.F90 Show resolved Hide resolved
@peverwhee peverwhee changed the title clubb_intr GPUization cam6_4_048: clubb_intr GPUization Nov 11, 2024
@peverwhee peverwhee changed the title cam6_4_048: clubb_intr GPUization cam6_4_049: clubb_intr GPUization Nov 12, 2024
@peverwhee peverwhee changed the title cam6_4_049: clubb_intr GPUization cam6_4_050: clubb_intr GPUization Nov 13, 2024
Copy link
Collaborator

@cacraigucar cacraigucar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I was curious, I took a cursory look at this PR. I did find one minor item which I believe would be good to add (since it took me awhile to find the answer). This is not intended to be a full review.

src/physics/cam/clubb_intr.F90 Outdated Show resolved Hide resolved
@nusbaume
Copy link
Collaborator

Hi @huebleruwm, apologies for the delay (we had some CESM-wide PRs that had to come in first), but this PR is finally close to the top of our queue and thus we are about to run our full regression testing suite. Are there any remaining modifications you wanted to make before we test and merge, or is this PR ready to go on your end?

@cacraigucar cacraigucar changed the title cam6_4_050: clubb_intr GPUization clubb_intr GPUization Dec 31, 2024
@cacraigucar cacraigucar changed the title clubb_intr GPUization cam6_4_051: clubb_intr GPUization Dec 31, 2024
@cacraigucar cacraigucar changed the title cam6_4_051: clubb_intr GPUization clubb_intr GPUization Jan 2, 2025
@cacraigucar cacraigucar changed the title clubb_intr GPUization cam6_4_052: clubb_intr GPUization Jan 3, 2025
@nusbaume nusbaume merged commit ab1e8c7 into ESCOMP:cam_development Jan 6, 2025
2 checks passed
@nusbaume
Copy link
Collaborator

nusbaume commented Jan 6, 2025

@sjsprecious In case you didn't see this I just wanted to let you know that this PR has now been merged into cam_development. It did change answers for the GPU regression test but I believe that was expected. Anyways, I hope that helps!

@sjsprecious
Copy link
Collaborator

Thanks @nusbaume. Since Gunther has verified the GPU results through ECT, everything looks good to me so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Tag
Development

Successfully merging this pull request may close these issues.

5 participants