Derecho CSLAM performance issue. #827
Replies: 12 comments 25 replies
-
@PeterHjortLauritzen - any ideas? |
Beta Was this translation helpful? Give feedback.
-
Actually not sure that this is the root cause - my second run with this mod hung as well. |
Beta Was this translation helpful? Give feedback.
-
Adding environment variable MPICH_COLL_SYNC=1 seems to have solved this problem and does not seem to adversely affect performance.
|
Beta Was this translation helpful? Give feedback.
-
Performance results are in the timing table. search for derecho. |
Beta Was this translation helpful? Give feedback.
-
In a comparison of cheyenne and derecho runs each using 1800 tasks I see a significant difference in time for the prim_advec_tracers_fvm
|
Beta Was this translation helpful? Give feedback.
-
Cheyenne:
Derecho:
|
Beta Was this translation helpful? Give feedback.
-
Can you point me to your code base? I would like to add more timers. It does not look like FVM:reconstruction:extend_panel1 increased timings explain the 10x ish slow down overall in fvm_resontruction. |
Beta Was this translation helpful? Give feedback.
-
Derecho:
Cheyenne:
|
Beta Was this translation helpful? Give feedback.
-
In your table above (repeated here): fvm:tracers_reconstruct 1800 1800 334428 41.2533 37.6388 1714 55.1831 198 Derecho: fvm:tracers_reconstruct 1800 1800 334428 748.8997 364.5543 1795 787.3841 76 If I understand it correctly fvm:tracers_reconstruct is ~750 on Derecho and ~41 on Cheyenne which is ~18x. Also, part 1-3 timings on Cheyenne more or less add up to fvm:tracers_reconstruct (which they should) but on Derecho they don't. |
Beta Was this translation helpful? Give feedback.
-
I've broken it down further, here the number following epi is the value of cubeboundary on the call.
|
Beta Was this translation helpful? Give feedback.
-
I added a timer to the call and return and found all of the extra time:
|
Beta Was this translation helpful? Give feedback.
-
Holy cow I think I fixed it. I'd better call it a weekend.
Here is the change that I made in fvm_consistent_se_cslam.F90:
|
Beta Was this translation helpful? Give feedback.
-
I am running PFS.ne30pg3_ne30pg3_mg17.FLTHIST_v0c.derecho_intel
and have run into a number of cases with the code hanging in bndry_mod.F90 looking at that code I see a couple of cases of neighborhood collectives when MPI_VERSION >= 3. I copied that file to SourceMods and disabled that check so that the older collective is used. That seems to be avoiding the hang.
Beta Was this translation helpful? Give feedback.
All reactions