-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault filling halo regions with Partition(y=2)
#3878
Comments
Why don't we test the distributed NonhydrostaticModel here? Oceananigans.jl/test/test_distributed_models.jl Lines 451 to 456 in 9ffbee3
or are there tests elsewhere? |
The test architectures are specified here: Oceananigans.jl/test/utils_for_runtests.jl Lines 6 to 24 in 9ffbee3
This was hard to find at first |
Are the distributed GPU tests actually running? I see this: And then subsequently it looks like the architecture is We need a better way to specify the test architectures? |
Damn, it looks like the tests on the GPU are not working because CUDA is not loaded properly. julia> using MPI
julia> MPI.has_cuda()
true |
Thank @simone-silvestri, it turns out that I wasn't using CUDA-aware MPI. #3883 addresses this by adding an error if CUDA-aware MPI is not available, so that we are not confronted with a mysterious segmentation fault (which could be caused by any number of issues, not just CUDA-aware MPI). Since we don't have GPU tests right now I will also check to make sure that this runs with a proper CUDA-aware MPI. |
I have reproduced the segmentation fault using the same MWE above, but with I was actually expecting openmpi/4.1.5+cuda (a module that can be loaded) to be CUDA-aware, but it is not as shown below: julia> using MPI
julia> MPI.has_cuda()
false Is there something I am missing here? |
Not sure, what cluster are you using? |
I am using Delta |
In this OpenMPI doc, the test seems to show that the MPI I am using is supported by CUDA:
And
Aside from requiring CUDA-aware MPI, could there be other factors causing the segmentation fault error? @glwagner Do you happen to solve the segfault error when running the MWE in your cluster? I am curious how that goes. |
Are you sure that this command:
tests the same MPI implementation that you are using to launch julia for your test? A related question is: what steps have you taken to ensure that the cluster openmpi (which is loaded as a module) is used to build I wrote up my experience with NCAR's Derecho because I was amazed at how intricate and fragile the process of getting CUDA-aware MPI to work was: #3669 |
I'll test it myself, but note that this is also tested in CI, for example: https://buildkite.com/clima/oceananigans-distributed/builds/5371#01944227-aca9-4485-a7ba-cac6571bf9ff/247-1301 So probably I should close this issue... |
Okay, I followed the instructions here: #3669, except applying it to the |
Thanks for all this helpful information, I will spend some time to understand CUDA-aware MPI more! |
Not sure how this is possible, but the following code throws a segfault:
I'm running with
(I found this error originally when trying to interpolate a field, but it seems it boils down to a halo filling issue)
This is the error I get:
I'll test CPU then try to see if this situation is tested.
The text was updated successfully, but these errors were encountered: