Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I use svdvals in parallel ? #75

Open
dh-ilight opened this issue Nov 17, 2021 · 19 comments
Open

How do I use svdvals in parallel ? #75

dh-ilight opened this issue Nov 17, 2021 · 19 comments

Comments

@dh-ilight
Copy link

dh-ilight commented Nov 17, 2021

My goal is to add svd(A::DArray) to the julia interface to elemental. I have been looking at svdvals
for understanding how to do it. But I do not know how to use svdvals in parallel. The following program
has a sum of diffs of close to 0 when using a single process. When I run it with julia -p 2 or mpiexecjl -n 2 the sum of diffs is large.
How do I convert the program to run in parallel ?

using Elemental
using DistributedArrays
using LinearAlgebra

A  = drandn(50,50)
Al = Matrix(A)

a = svdvals(Al)
b = Elemental.svdvals(A)
println("sum of diffs= ",sum(a-b))
@JBlaschke
Copy link

Hi @dhiepler -- I also saw your trouble ticket at NERSC, let me respond here so that the community can weigh in.

without having tried it myself (I am currently traveling, and have bad internet -- and alas my burner laptop doesn't have Julia), here are some ideas:

  1. Are a and b ordered in the same way (normally it's a largest-to-smallest order but this should be checked)
  2. Is Matrix the right object to use? Or do we want DistMatrix?
  3. Can you check if a and b are actually singular values (i.e. check if they are the absolute value of eigenvalues)?

I will check these myself soon, but maybe @andreasnoack has some ideas in the meantime.

Cheers,
Johannes

@JBlaschke
Copy link

I have more information thanks also to @dhiepler, here is an expanded test program:

using Distributed
using Elemental
using DistributedArrays
using LinearAlgebra
@everywhere using Random
@everywhere Random.seed!(123)

A = drandn(10,10)
Al = Matrix(A)
println("sum of diffs= ",sum(A-Al))

println("Determinant of singular matrix is 0")
#println("det(A)= ", det(A)) # not implemented for DArray
println("det(Al)= ", det(Al))

a = svdvals(Al)
b = Elemental.svdvals(A)
println("a= ", a)
println("b= ", b)
println("a-b= ", a-b)
println("a./b= ", a./b)


#println("det(a) = ", det(a))
println("sum of ratios after svd = ",sum(a./b))
println("sum of diffs after svd = ",sum(a-b))
println("approx= ", a  b) 

What I found was rather startling: it works fine on my laptop with more than one MPI rank. When I run it using MPI on Cori, then I get a constant factor difference in each element:

One rank:

blaschke@nid12849:~/tmp> srun -n 1 julia test_sdvals_2.jl 
sum of diffs= 0.0
Determinant of singular matrix is 0
det(Al)= -1191.023311448886
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
b= [6.351702336978771; 5.880409677742518; 4.799195878542354; 3.3478338822775187; 2.980624165171743; 2.0566695219525486; 1.4953776892773092; 1.1050612757140172; 0.6085328305411005; 0.3219564747545404]
a-b= [4.440892098500626e-15; -2.6645352591003757e-15; 1.7763568394002505e-15; 1.3322676295501878e-15; -1.3322676295501878e-15; 0.0; 2.220446049250313e-16; -8.881784197001252e-16; -3.3306690738754696e-16; 2.220446049250313e-16]
a./b= [1.0000000000000007; 0.9999999999999996; 1.0000000000000004; 1.0000000000000004; 0.9999999999999996; 1.0; 1.0000000000000002; 0.9999999999999992; 0.9999999999999994; 1.0000000000000007]
sum of ratios after svd = 9.999999999999998
sum of diffs after svd = 2.7755575615628914e-15
approx= true

Two ranks

blaschke@nid12849:~/tmp> srun -n 2 julia test_sdvals_2.jl 
sum of diffs= 0.0
Determinant of singular matrix is 0
sum of diffs= 0.0
Determinant of singular matrix is 0
det(Al)= -1191.023311448886
det(Al)= -1191.023311448886
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
b= [12.703404673957543; 11.760819355485031; 9.598391757084704; 6.695667764555037; 5.961248330343489; 4.113339043905101; 2.990755378554618; 2.210122551428033; 1.2170656610821995; 0.6439129495090811]
b= [12.703404673957543; 11.760819355485031; 9.598391757084704; 6.695667764555037; 5.961248330343489; 4.113339043905101; 2.990755378554618; 2.210122551428033; 1.2170656610821995; 0.6439129495090811]
a-b= [-6.351702336978768; -5.880409677742516; -4.799195878542348; -3.3478338822775173; -2.980624165171747; -2.056669521952552; -1.4953776892773085; -1.1050612757140168; -0.6085328305410993; -0.3219564747545404]
a-b= [-6.351702336978768; -5.880409677742516; -4.799195878542348; -3.3478338822775173; -2.980624165171747; -2.056669521952552; -1.4953776892773085; -1.1050612757140168; -0.6085328305410993; -0.3219564747545404]
a./b= [0.5000000000000003; 0.5; 0.5000000000000003; 0.5000000000000002; 0.49999999999999956; 0.49999999999999956; 0.5000000000000001; 0.4999999999999999; 0.5000000000000003; 0.5000000000000002]
sum of ratios after svd = 5.0
sum of diffs after svd = -28.947363732952414
a./b= [0.5000000000000003; 0.5; 0.5000000000000003; 0.5000000000000002; 0.49999999999999956; 0.49999999999999956; 0.5000000000000001; 0.4999999999999999; 0.5000000000000003; 0.5000000000000002]
sum of ratios after svd = 5.0
sum of diffs after svd = -28.947363732952414
approx= false
approx= false

Four ranks

blaschke@nid12849:~/tmp> srun -n 4 julia test_sdvals_2.jl 
sum of diffs= 0.0
Determinant of singular matrix is 0
sum of diffs= 0.0
sum of diffs= 0.0
Determinant of singular matrix is 0
Determinant of singular matrix is 0
sum of diffs= 0.0
Determinant of singular matrix is 0
det(Al)= -1191.023311448886
det(Al)= -1191.023311448886
det(Al)= -1191.023311448886
det(Al)= -1191.023311448886
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
a= [6.351702336978775, 5.880409677742516, 4.799195878542355, 3.34783388227752, 2.980624165171742, 2.0566695219525486, 1.4953776892773094, 1.1050612757140164, 0.6085328305411002, 0.32195647475454064]
b= [25.406809347915086; 23.521638710970063; 19.196783514169415; 13.391335529110073; 11.92249666068697; 8.2266780878102; 5.981510757109234; 4.420245102856065; 2.4341313221643994; 1.2878258990181617]
b= [25.406809347915086; 23.521638710970063; 19.196783514169415; 13.391335529110073; 11.92249666068697; 8.2266780878102; 5.981510757109234; 4.420245102856065; 2.4341313221643994; 1.2878258990181617]
b= [25.406809347915086; 23.521638710970063; 19.196783514169415; 13.391335529110073; 11.92249666068697; 8.2266780878102; 5.981510757109234; 4.420245102856065; 2.4341313221643994; 1.2878258990181617]
b= [25.406809347915086; 23.521638710970063; 19.196783514169415; 13.391335529110073; 11.92249666068697; 8.2266780878102; 5.981510757109234; 4.420245102856065; 2.4341313221643994; 1.2878258990181617]
a-b= [-19.05510701093631; -17.641229033227546; -14.397587635627058; -10.043501646832553; -8.941872495515229; -6.170008565857652; -4.486133067831925; -3.3151838271420493; -1.8255984916232992; -0.965869424263621]
a-b= [-19.05510701093631; -17.641229033227546; -14.397587635627058; -10.043501646832553; -8.941872495515229; -6.170008565857652; -4.486133067831925; -3.3151838271420493; -1.8255984916232992; -0.965869424263621]
a-b= [-19.05510701093631; -17.641229033227546; -14.397587635627058; -10.043501646832553; -8.941872495515229; -6.170008565857652; -4.486133067831925; -3.3151838271420493; -1.8255984916232992; -0.965869424263621]
a-b= [-19.05510701093631; -17.641229033227546; -14.397587635627058; -10.043501646832553; -8.941872495515229; -6.170008565857652; -4.486133067831925; -3.3151838271420493; -1.8255984916232992; -0.965869424263621]
a./b= [0.25000000000000017; 0.25; 0.2500000000000001; 0.2500000000000001; 0.24999999999999992; 0.24999999999999983; 0.25000000000000017; 0.25; 0.2500000000000001; 0.25000000000000017]
sum of ratios after svd = 2.5000000000000004
sum of diffs after svd = -86.84209119885725
a./b= [0.25000000000000017; 0.25; 0.2500000000000001; 0.2500000000000001; 0.24999999999999992; 0.24999999999999983; 0.25000000000000017; 0.25; 0.2500000000000001; 0.25000000000000017]
sum of ratios after svd = 2.5000000000000004
sum of diffs after svd = -86.84209119885725
a./b= [0.25000000000000017; 0.25; 0.2500000000000001; 0.2500000000000001; 0.24999999999999992; 0.24999999999999983; 0.25000000000000017; 0.25; 0.2500000000000001; 0.25000000000000017]
sum of ratios after svd = 2.5000000000000004
sum of diffs after svd = -86.84209119885725
a./b= [0.25000000000000017; 0.25; 0.2500000000000001; 0.2500000000000001; 0.24999999999999992; 0.24999999999999983; 0.25000000000000017; 0.25; 0.2500000000000001; 0.25000000000000017]
sum of ratios after svd = 2.5000000000000004
sum of diffs after svd = -86.84209119885725
approx= false
approx= false
approx= false
approx= false

My crude guess would be that there is a factor difference of 1/n (n=number of ranks). I built the version on Cori against Cray-MPICH. Any ideas @andreasnoack

@andreasnoack
Copy link
Member

andreasnoack commented Dec 1, 2021

I can reproduce the bug on my laptop. My guess is that this is an issue with the conversion between Elemental arrays and DistributedArrays.DArray. If I just use the Elemental's DistMatrix then I don't get wrong results, i.e.

using Elemental
using LinearAlgebra

A = Elemental.DistMatrix(Float64)
Elemental.gaussian!(A, 10, 10)
Al = Array(A)
println("sum of diffs= ", sum(A - Al))

a = svdvals(Al)
b = svdvals(A)
println("a = ", a)
println("b = ", b)
println("a - b= ", a - b)
println("a ./ b= ", a ./ b)

@info "DONE!"

@JBlaschke
Copy link

Thanks @andreasnoack -- I just verified that this also works on Cori. This leaves an open question given what @dhiepler is trying to do with the original code: How to create an Elemental array from an existing array (if not from a DArray) ?

@vchuravy
Copy link
Member

vchuravy commented Dec 4, 2021

using MPIClusterManagers, Distributed

man = MPIManager(np = 4)
addprocs(man) # backend communicates with MPI

@everywhere begin
    using Elemental
    using LinearAlgebra
end

using DistributedArrays
A = drandn(50,50)
Al = Matrix(A)

a = svdvals(Al)
b = Array(Elemental.svdvals(A))

println("sum of diffs= ",sum(a-b))

@mpi_do man begin
    grid = Elemental.DefaultGrid()
    @show (Elemental.row(grid), Elemental.column(grid))
    A = Elemental.DistMatrix(Float64)
    Elemental.zeros!(A, 50,50)
    @show Elemental.localHeight(A), Elemental.height(A)
    @show Elemental.localWidth(A), Elemental.width(A)
end

On #77 resulted for me in:

      From worker 3:    (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
      From worker 2:    (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
      From worker 4:    (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
      From worker 5:    (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
      From worker 3:    (Elemental.localHeight(A), Elemental.height(A)) = (50, 50)
      From worker 2:    (Elemental.localHeight(A), Elemental.height(A)) = (50, 50)
      From worker 4:    (Elemental.localHeight(A), Elemental.height(A)) = (50, 50)
      From worker 3:    (Elemental.localWidth(A), Elemental.width(A)) = (50, 50)
      From worker 2:    (Elemental.localWidth(A), Elemental.width(A)) = (50, 50)
      From worker 5:    (Elemental.localHeight(A), Elemental.height(A)) = (50, 50)
      From worker 4:    (Elemental.localWidth(A), Elemental.width(A)) = (50, 50)
      From worker 5:    (Elemental.localWidth(A), Elemental.width(A)) = (50, 50)

Meaning that the Grid configuration did not work. I double checked my MPI.jl and I was using OpenBLAS_jll as my MPI provider (I would believe that using the system MPI would yield the same effect.). The default Elemental_jll we are shipping uses MPICH_jll

You can check yourself with:

julia> MPI.identify_implementation()
(MPI.MPICH, v"3.4.2")

After switching to MPICH_jll:

JULIA_MPI_BINARY="MPICH_jll" julia --project=. -e 'import Pkg; Pkg.build("MPI")'

I got:

vchuravy@odin ~/s/s/a/svd2> JULIA_PROJECT=(pwd) julia --project=. test.jl
      From worker 3:    (Elemental.row(grid), Elemental.column(grid)) = (1, 0)
      From worker 2:    (Elemental.row(grid), Elemental.column(grid)) = (0, 0)
      From worker 4:    (Elemental.row(grid), Elemental.column(grid)) = (0, 1)
      From worker 5:    (Elemental.row(grid), Elemental.column(grid)) = (1, 1)
      From worker 3:    (Elemental.localHeight(A), Elemental.height(A)) = (25, 50)
      From worker 2:    (Elemental.localHeight(A), Elemental.height(A)) = (25, 50)
      From worker 4:    (Elemental.localHeight(A), Elemental.height(A)) = (25, 50)
      From worker 3:    (Elemental.localWidth(A), Elemental.width(A)) = (25, 50)
      From worker 2:    (Elemental.localWidth(A), Elemental.width(A)) = (25, 50)
      From worker 5:    (Elemental.localHeight(A), Elemental.height(A)) = (25, 50)
      From worker 4:    (Elemental.localWidth(A), Elemental.width(A)) = (25, 50)
      From worker 5:    (Elemental.localWidth(A), Elemental.width(A)) = (25, 50)

When JuliaParallel/MPI.jl#513 lands we will be able to fix issues like this.

Now:

using MPIClusterManagers, Distributed

man = MPIManager(np = 4)
addprocs(man) # backend communicates with MPI

using Elemental
using LinearAlgebra
using DistributedArrays

A = drandn(50,50)
Al = Matrix(A)

a = svdvals(Al)
b = Array(Elemental.svdvals(A))

println("sum of diffs= ",sum(a-b))

yields sum of diffs= 1.2073675392798577e-14

@JBlaschke
Copy link

Thanks @vchuravy for looking into this also. Unfortunately it doesn't work on Cori. Is the MPIClusterManagers part necessary? (I find that it seems to hang/time out on Cori -- maybe we can look into why in a separate issue/part of the Julia for HPC group).

Taking out the MPIClusterManagers part, I get:

blaschke@nid00805:~/tmp> JULIA_MPI_BINARY="MPICH_jll" srun -n 4 julia test_sdvals_4.jl 
sum of diffs= -297.4855631120604
sum of diffs= -296.6150430552368
sum of diffs= -303.4465072750694
sum of diffs= -293.6032813641691

using the system MPICH (i.e. cray-mpich).

@vchuravy
Copy link
Member

vchuravy commented Dec 7, 2021

Yeah the MPIClusterManager is necessary for the variant of the code that uses DistributedArrays otherwise the ranks are not wired-up correctly.

@JBlaschke
Copy link

Oh Feck! So now I have to fix MPIClusterManager @vchuravy should I open an issue on https://github.com/JuliaParallel/MPIClusterManagers.jl?

@vchuravy
Copy link
Member

vchuravy commented Dec 7, 2021

should I open an issue

Please do :)

@JBlaschke
Copy link

Done!

@JBlaschke
Copy link

JBlaschke commented Dec 8, 2021

@vchuravy Based on our discussion in JuliaParallel/MPIClusterManagers.jl#26 your solution was to run the addprocs version without srun. That won't work here though, because without srun running:

using Elemental

throws

Wed Dec  8 12:00:48 2021: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1                                                                                                             
Wed Dec  8 12:00:48 2021: [unset]:_pmi_init:_pmi_alps_init returned -1                                                                                                                                      
[Wed Dec  8 12:00:59 2021] [c3-0c2s12n1] Fatal error in MPI_Init: Other MPI error, error stack:                                                                                                             
MPIR_Init_thread(537):                                                                                                                                                                                      
MPID_Init(246).......: channel initialization failed                                                                                                                                                        
MPID_Init(647).......:  PMI2 init failed: 1

when trying to initialize Elemental.jl ... we appear to have a catch 22 :(

@JBlaschke
Copy link

And if I avoid the addprocs and run the following within srun, I get:

using MPIClusterManagers, Distributed                            
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)  
                                                                 
using Elemental                                                  
using LinearAlgebra                                              
using DistributedArrays                                          
                                                                 
A = drandn(50,50)                                                
Al = Matrix(A)                                                   
                                                                 
a = svdvals(Al)                                                  
b = Array(Elemental.svdvals(A))                                  
                                                                 
println("sum of diffs= ",sum(a-b))                               
MPIClusterManagers.stop_main_loop(manager)                       

I get a deadlock.

@vchuravy
Copy link
Member

vchuravy commented Dec 9, 2021

Without srun, but with salloc?

@JBlaschke
Copy link

Elemental needs srun (without srun, Elemental won't be able to initialize Cray MPICH)

@vchuravy
Copy link
Member

vchuravy commented Dec 9, 2021

I am confused. MPIClusterManager should use srun/mpiexec to connect the "worker" processes which allows them to use Elemental and to communicate among themselves using MPI. The front-end should not need to run under mpiexec.

@JBlaschke
Copy link

I don't know all the details at this point, but if I run Elemental (remember, we don't use the one provided by BinaryBuilder -- that one doesn't use Cray MPICH, and therefore doesn't work on Cori -- but instead we build our own version using the Cray compiler wrappers) without srun, then I get:

MPIR_Init_thread(537):                                                                                                                                                                                      
MPID_Init(246).......: channel initialization failed                                                                                                                                                        
MPID_Init(647).......:  PMI2 init failed: 1

This is a common error when a program runs MPI_Init() without being launched in srun. I haven't had a chance to dig around the guts of MPIClusterManagers (I plan to). I suspect that MPIClusterManagers are launching the workers using srun, thereby avoiding the srun-less MPI_Init.

Here is a theory that I will try next: would wrapping using Elemental with a @mpi_do fix this? (I.e. we want to avoid the situation where anything except the workers will try to initialize MPI).

@vchuravy
Copy link
Member

Oh yeah that makes sense. That is rather unfriendly behavior on the Cray side... We still load the Elemental library on the front-end process.

@vchuravy
Copy link
Member

Maybe we can use the PMI cluster manager I wrote https://github.com/JuliaParallel/PMI.jl/blob/main/examples/distributed.jl

@JBlaschke
Copy link

Alas @vchuravy Cray MPICH wants PMI2 :(

cori01:~> export LD_LIBRARY_PATH=/usr/lib64/slurmpmi:$LD_LIBRARY_PATH
cori01:~> srun -n 2 -C haswell julia pmi_simp.jl
srun: job 51995644 queued and waiting for resources
srun: job 51995644 has been allocated resources
/global/common/cori_cle7/software/julia/julia/1.6.0/bin/julia: symbol lookup error: /opt/cray/pe/mpt/7.7.14/gni/mpich-intel/16.0/lib/libmpich.so: undefined symbol: PMI2_Init
srun: error: nid00915: task 0: Exited with exit code 127
srun: launch/slurm: _step_signal: Terminating StepId=51995644.0
srun: error: nid00915: task 1: Exited with exit code 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants