-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cam6_4_052: clubb_intr GPUization #1175
cam6_4_052: clubb_intr GPUization #1175
Conversation
…e options from running when compiled with OpenACC
… by breaking BFBness. All these values are ~0 but not always exactly.
…t and BFB on CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I had some questions and change requests but none of them are required, and of course if you have any concerns with any of my requests then just let me know. Thanks!
… in clubb_ini_cam. Reusing inv_exner_clubb(:,pver) to set inv_exner_clubb_surf. Splitting up array initialization loop to slightly improve CPU performance.
Good finds, I took you up on every suggestion. |
I ran a PFS test to check how the performance changed, and initially found that clubb_tend_cam was ~8% slower with these changes (up to cfd2824). Which is unacceptable. I took a guess that the biggest performance killer was a massive loop I made to replace a large amount of vector notation, where the code just zeros out a bunch of arrays. That seemed to be the culprit, since I split the loop up here around line 2991, and the result was slightly faster than the original code. Here's the timing output comparison now. I ran these a couple times and got roughly the same results. From the cam_dev head I started with (cam6_4_038)
From the head of this branch (75f51d1)
So it seems these changes made I ran all the same tests mentioned above to confirm these new changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really nice cleanup and the GPU code is not too obtrusive. I give it five stars!
My comments are mostly questions so I can understand what is going on as best as possible.
I am only about 96% confidant that the vertical dimension changes are correct through the whole file. I think the regression tests will be necessary to make sure we have that right in every case.
…st the function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I was curious, I took a cursory look at this PR. I did find one minor item which I believe would be good to add (since it took me awhile to find the answer). This is not intended to be a full review.
Hi @huebleruwm, apologies for the delay (we had some CESM-wide PRs that had to come in first), but this PR is finally close to the top of our queue and thus we are about to run our full regression testing suite. Are there any remaining modifications you wanted to make before we test and merge, or is this PR ready to go on your end? |
@sjsprecious In case you didn't see this I just wanted to let you know that this PR has now been merged into |
Thanks @nusbaume. Since Gunther has verified the GPU results through ECT, everything looks good to me so far. |
This only modifies clubb_intr.F90 and doesn't require a new verseion of clubb. The purpose of this is the addition of
acc
directives, added in order to offload computations to GPUs. Besides the directives, this mainly consists of replacing vector notation with explicit loops, combining loops with the same bounds where possible, and moving non-gpuized function calls to outside of the GPU section. I also added some new notation for the number of vertical levels (nzm_clubb
andnzt_clubb
) that improves readability and will make it easier to merge in with future versions of clubb. I also included some timing statements, similar to the ones added in the Earthworks ew-develop branch, which this version of clubb_intr is also compatible with.This is BFB on CPUs (tested with intel), runs with intel+debugging, and passes the ECT test when comparing CPU results to GPU results (using cam7). There's some options that I didn't GPUize or test (
do_clubb_mf
,do_rainturb
,do_cldcool
,clubb_do_icesuper
,single_column
), so I left the code for them untouched and added some checks to stop the run if they're set when the code is compiled with OpenACC.If there ends up being something wrong with these changes then this version, which is an earlier commit that contains only a new OpenACC data statement and some timer additions, would be nice to get in at least.