Segmentation fault filling halo regions with `Partition(y=2)` #3878

glwagner · 2024-10-29T04:14:53Z

Not sure how this is possible, but the following code throws a segfault:

using Oceananigans
using Oceananigans.BoundaryConditions: fill_halo_regions!

partition = Partition(y=2)
arch = Distributed(GPU(); partition)
x = y = z = (0, 1)
grid = RectilinearGrid(arch; size=(16, 16, 16), x, y, z, topology=(Periodic, Periodic, Bounded))
c = CenterField(grid)
fill_halo_regions!(c)

I'm running with

$ mpiexecjl -n 2 julia --project test_interpolate.jl

(I found this error originally when trying to interpolate a field, but it seems it boils down to a halo filling issue)

This is the error I get:

[ Info: Oceananigans will use 32 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: Oceananigans will use 32 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().

[116989] signal (11.2): Segmentation fault
in expression starting at /orcd/data/raffaele/001/glwagner/OceananigansPaper/listings/test_interpolate.jl:10
__memcpy_ssse3 at /lib64/libc.so.6 (unknown line)
MPIDI_CH3_iSendv at /orcd/data/raffaele/001/glwagner/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPIDI_CH3_EagerContigIsend at /orcd/data/raffaele/001/glwagner/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPID_Isend at /orcd/data/raffaele/001/glwagner/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPI_Isend at /orcd/data/raffaele/001/glwagner/.julia/artifacts/e85c0a68e07fee0ee7b19c2abc210b1af2f4771a/lib/libmpi.so (unknown line)
MPI_Isend at /orcd/data/raffaele/001/glwagner/.julia/packages/MPI/TKXAj/src/api/generated_api.jl:2151 [inlined]
Isend at /orcd/data/raffaele/001/glwagner/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:66
Isend at /orcd/data/raffaele/001/glwagner/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:70 [inlined]
Isend at /orcd/data/raffaele/001/glwagner/.julia/packages/MPI/TKXAj/src/pointtopoint.jl:70 [inlined]
send_south_halo at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:317
#fill_south_and_north_halo!#50 at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:263
fill_south_and_north_halo! at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:250
unknown function (ip: 0x2aaac8afa8b6)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
#fill_halo_event!#40 at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:208
fill_halo_event! at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:193
unknown function (ip: 0x2aaac8aefb2e)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
#fill_halo_regions!#38 at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:114
fill_halo_regions! at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:101 [inlined]
#fill_halo_regions!#37 at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:90 [inlined]
fill_halo_regions! at /orcd/data/raffaele/001/glwagner/Oceananigans.jl/src/DistributedComputations/halo_communication.jl:87
unknown function (ip: 0x2aaac8ad0ee5)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
jl_apply at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:126
eval_value at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:617
jl_interpret_toplevel_thunk at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/toplevel.c:877
ijl_toplevel_eval_in at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
include_string at ./loading.jl:2076
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
_include at ./loading.jl:2136
include at ./Base.jl:495
jfptr_include_46447.1 at /orcd/data/raffaele/001/glwagner/Software/julia-1.10.5/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
exec_options at ./client.jl:318
_start at ./client.jl:552
jfptr__start_82798.1 at /orcd/data/raffaele/001/glwagner/Software/julia-1.10.5/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/gf.c:3077
jl_apply at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/builder-amdci4-4/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 26236174 (Pool: 26209699; Big: 26475); GC: 35

I'll test CPU then try to see if this situation is tested.

The text was updated successfully, but these errors were encountered:

glwagner · 2024-10-29T06:43:20Z

Why don't we test the distributed NonhydrostaticModel here?

Oceananigans.jl/test/test_distributed_models.jl

Lines 451 to 456 in 9ffbee3

    
           if CPU() ∈ archs  
        
               for partition in [Partition(1, 4), Partition(2, 2), Partition(4, 1)] 
        
                   @info "Time-stepping a distributed NonhydrostaticModel with partition $partition..." 
        
                   arch = Distributed(; partition) 
        
                   grid = RectilinearGrid(arch, topology=(Periodic, Periodic, Periodic), size=(8, 8, 8), extent=(1, 2, 3)) 
        
                   model = NonhydrostaticModel(; grid)

or are there tests elsewhere?

glwagner · 2024-10-29T06:47:08Z

The test architectures are specified here:

Oceananigans.jl/test/utils_for_runtests.jl

Lines 6 to 24 in 9ffbee3

    
           test_child_arch() = CUDA.has_cuda() ? GPU() : CPU() 
        
           function test_architectures()  
        
               child_arch =  test_child_arch() 
        
               # If MPI is initialized with MPI.Comm_size > 0, we are running in parallel. 
        
               # We test several different configurations: `Partition(x = 4)`, `Partition(y = 4)`,  
        
               # `Partition(x = 2, y = 2)`, and different fractional subdivisions in x, y and xy 
        
               if MPI.Initialized() && MPI.Comm_size(MPI.COMM_WORLD) == 4 
        
                   return (Distributed(child_arch; partition = Partition(4)), 
        
                           Distributed(child_arch; partition = Partition(1, 4)), 
        
                           Distributed(child_arch; partition = Partition(2, 2)), 
        
                           Distributed(child_arch; partition = Partition(x = Fractional(1, 2, 3, 4))), 
        
                           Distributed(child_arch; partition = Partition(y = Fractional(1, 2, 3, 4))), 
        
                           Distributed(child_arch; partition = Partition(x = Fractional(1, 2), y = Equal())))  
        
               else 
        
                   return tuple(child_arch) 
        
               end 
        
           end

This was hard to find at first

glwagner · 2024-10-29T06:52:45Z

Are the distributed GPU tests actually running?

I see this:

https://buildkite.com/clima/oceananigans-distributed/builds/4081#0192d4e4-191f-48e1-a943-d82377d8a125/189-1099

And then subsequently it looks like the architecture is Distributed{CPU}.

We need a better way to specify the test architectures?

glwagner · 2024-10-29T06:52:58Z

@simone-silvestri

simone-silvestri · 2024-10-29T10:47:01Z

Damn, it looks like the tests on the GPU are not working because CUDA is not loaded properly.
I am trying to address this in #3880. A segmentation fault probably means the MPI is not CUDA-aware. Typically, the MPI that is shipped with MPI_jll is not cuda-aware. A good way to check is

julia> using MPI

julia> MPI.has_cuda()
true

glwagner · 2024-10-29T20:32:34Z

Thank @simone-silvestri, it turns out that I wasn't using CUDA-aware MPI.

#3883 addresses this by adding an error if CUDA-aware MPI is not available, so that we are not confronted with a mysterious segmentation fault (which could be caused by any number of issues, not just CUDA-aware MPI).

Since we don't have GPU tests right now I will also check to make sure that this runs with a proper CUDA-aware MPI.

liuchihl · 2025-01-08T19:56:23Z

I have reproduced the segmentation fault using the same MWE above, but withsrun -n 2 julia --project test_multiGPU.jl.

I was actually expecting openmpi/4.1.5+cuda (a module that can be loaded) to be CUDA-aware, but it is not as shown below:

julia> using MPI

julia> MPI.has_cuda()
false

Is there something I am missing here?

glwagner · 2025-01-08T21:33:03Z

Not sure, what cluster are you using?

liuchihl · 2025-01-08T21:35:25Z

I am using Delta

liuchihl · 2025-01-08T23:36:22Z

In this OpenMPI doc, the test seems to show that the MPI I am using is supported by CUDA:

$ ompi_info | grep "MPI extensions"
[dt-login02.delta.ncsa.illinois.edu:95561] mca: base: components_register: registering framework ras components
[dt-login02.delta.ncsa.illinois.edu:95561] mca: base: components_register: found loaded component simulator
[dt-login02.delta.ncsa.illinois.edu:95561] mca: base: components_register: component simulator register function successful
[dt-login02.delta.ncsa.illinois.edu:95561] mca: base: components_register: found loaded component slurm
[dt-login02.delta.ncsa.illinois.edu:95561] mca: base: components_register: component slurm register function successful
          MPI extensions: affinity, cuda, pcollreq
[dt-login02.delta.ncsa.illinois.edu:95561] mca: base: close: unloading component simulator
[dt-login02.delta.ncsa.illinois.edu:95561] mca: base: close: unloading component slurm

And

$ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
[dt-login02.delta.ncsa.illinois.edu:208020] mca: base: components_register: registering framework ras components
[dt-login02.delta.ncsa.illinois.edu:208020] mca: base: components_register: found loaded component simulator
[dt-login02.delta.ncsa.illinois.edu:208020] mca: base: components_register: component simulator register function successful
[dt-login02.delta.ncsa.illinois.edu:208020] mca: base: components_register: found loaded component slurm
[dt-login02.delta.ncsa.illinois.edu:208020] mca: base: components_register: component slurm register function successful
mca:mpi:base:param:mpi_built_with_cuda_support:value:true
[dt-login02.delta.ncsa.illinois.edu:208020] mca: base: close: unloading component simulator
[dt-login02.delta.ncsa.illinois.edu:208020] mca: base: close: unloading component slurm

Aside from requiring CUDA-aware MPI, could there be other factors causing the segmentation fault error?

@glwagner Do you happen to solve the segfault error when running the MWE in your cluster? I am curious how that goes.
Thanks.

glwagner · 2025-01-09T05:29:49Z

Are you sure that this command:

$ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value

tests the same MPI implementation that you are using to launch julia for your test?

A related question is: what steps have you taken to ensure that the cluster openmpi (which is loaded as a module) is used to build MPI.jl? This can often be a little tricky. Here is the documentation: https://juliaparallel.org/MPI.jl/stable/configuration/

I wrote up my experience with NCAR's Derecho because I was amazed at how intricate and fragile the process of getting CUDA-aware MPI to work was: #3669

glwagner · 2025-01-09T05:32:01Z

I'll test it myself, but note that this is also tested in CI, for example: https://buildkite.com/clima/oceananigans-distributed/builds/5371#01944227-aca9-4485-a7ba-cac6571bf9ff/247-1301

So probably I should close this issue...

glwagner · 2025-01-09T05:40:07Z

Okay, I followed the instructions here: #3669, except applying it to the mwe.jl in the top post. The job is currently in the queue so I will report if there are errors or not.

liuchihl · 2025-01-09T06:43:36Z

Thanks for all this helpful information, I will spend some time to understand CUDA-aware MPI more!

glwagner added the distributed 🕸️ Our plan for total cluster domination label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault filling halo regions with `Partition(y=2)` #3878

Segmentation fault filling halo regions with `Partition(y=2)` #3878

glwagner commented Oct 29, 2024 •

edited

Loading

glwagner commented Oct 29, 2024

glwagner commented Oct 29, 2024

glwagner commented Oct 29, 2024

glwagner commented Oct 29, 2024

simone-silvestri commented Oct 29, 2024

glwagner commented Oct 29, 2024

liuchihl commented Jan 8, 2025

glwagner commented Jan 8, 2025

liuchihl commented Jan 8, 2025

liuchihl commented Jan 8, 2025

glwagner commented Jan 9, 2025 •

edited

Loading

glwagner commented Jan 9, 2025

glwagner commented Jan 9, 2025

liuchihl commented Jan 9, 2025

Segmentation fault filling halo regions with Partition(y=2) #3878

Segmentation fault filling halo regions with Partition(y=2) #3878

Comments

glwagner commented Oct 29, 2024 • edited Loading

glwagner commented Oct 29, 2024

glwagner commented Oct 29, 2024

glwagner commented Oct 29, 2024

glwagner commented Oct 29, 2024

simone-silvestri commented Oct 29, 2024

glwagner commented Oct 29, 2024

liuchihl commented Jan 8, 2025

glwagner commented Jan 8, 2025

liuchihl commented Jan 8, 2025

liuchihl commented Jan 8, 2025

glwagner commented Jan 9, 2025 • edited Loading

glwagner commented Jan 9, 2025

glwagner commented Jan 9, 2025

liuchihl commented Jan 9, 2025

Segmentation fault filling halo regions with `Partition(y=2)` #3878

Segmentation fault filling halo regions with `Partition(y=2)` #3878

glwagner commented Oct 29, 2024 •

edited

Loading

glwagner commented Jan 9, 2025 •

edited

Loading