Fix HIP multi-GPU bug #822

BohanZhang0908 · 2024-12-12T01:36:05Z

Summary
To fix a problem about multi-GPU parallelism on AMD.
Modification
5 additions on the code
Others

To fix a problem about multi-GPU parallelism on AMD.

Add files via upload

brucefan1983 · 2024-12-12T07:32:34Z

@Dankomaister
@jesperbygg

HIP multi-GPU bug fixed!

brucefan1983 · 2024-12-12T07:32:53Z

@elindgren

jesperbygg · 2024-12-12T07:36:49Z

Excellent!

elindgren · 2024-12-12T07:37:44Z

Nice! Was the problem that there were still ongoing calculations on the GPU from previous compute calls? Or why the need for gpuDeviceSynchronize

brucefan1983 · 2024-12-12T07:42:04Z

perhaps CUDA version is just lucky. Need a sync rigorously speaking.

Dankomaister · 2024-12-12T11:15:16Z

This is great! I will do some benchmarking :)

Btw @brucefan1983 would it be possible to lower the system size requirement for running on multiple GPUs? or is there some technical limitation for this (apart from efficiency).

Now we require that the shortest lattice constant is more than 5 times the cutoff per GPU. "The longest direction has less than 5 times of the NEP cutoff per GPU."

Since we have 8 MI250x GPUs on one node (4 physical cards) that means we need 5 x 8 x cutoff which is a huge system especially if one has a cubic box.

brucefan1983 · 2024-12-12T18:13:44Z

This is great! I will do some benchmarking :)

Btw @brucefan1983 would it be possible to lower the system size requirement for running on multiple GPUs? or is there some technical limitation for this (apart from efficiency).

Now we require that the shortest lattice constant is more than 5 times the cutoff per GPU. "The longest direction has less than 5 times of the NEP cutoff per GPU."

Since we have 8 MI250x GPUs on one node (4 physical cards) that means we need 5 x 8 x cutoff which is a huge system especially if one has a cubic box.

Yes this requirement is based on efficiency. Perhaps it can be reduced to a minimal of 3 cutoffs. You can try to change the code to

 if (num_bins_longitudinal < 6) {
    printf("The longest direction has less than 3 times of the NEP cutoff per GPU.\n");
    printf("Please reduce the number of GPUs or increase the simulation cell size.\n");
    exit(1);
  }

But I encourage to compare the performance with increasing number of GPUs and then choose wisely.

BohanZhang0908 added 3 commits December 12, 2024 09:19

Add files via upload

a08403e

To fix a problem about multi-GPU parallelism on AMD.

Merge pull request #1 from BohanZhang0908/BohanZhang0908-patch-1

018ca38

Add files via upload

Merge branch 'brucefan1983:master' into master

615bb3d

brucefan1983 changed the title ~~Change~~ Fix HIP multi-GPU bug Dec 12, 2024

brucefan1983 merged commit 8d885f4 into brucefan1983:master Dec 12, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix HIP multi-GPU bug #822

Fix HIP multi-GPU bug #822

BohanZhang0908 commented Dec 12, 2024

brucefan1983 commented Dec 12, 2024

brucefan1983 commented Dec 12, 2024

jesperbygg commented Dec 12, 2024

elindgren commented Dec 12, 2024

brucefan1983 commented Dec 12, 2024

Dankomaister commented Dec 12, 2024

brucefan1983 commented Dec 12, 2024

Fix HIP multi-GPU bug #822

Fix HIP multi-GPU bug #822

Conversation

BohanZhang0908 commented Dec 12, 2024

brucefan1983 commented Dec 12, 2024

brucefan1983 commented Dec 12, 2024

jesperbygg commented Dec 12, 2024

elindgren commented Dec 12, 2024

brucefan1983 commented Dec 12, 2024

Dankomaister commented Dec 12, 2024

brucefan1983 commented Dec 12, 2024