-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I use svdvals in parallel ? #75
Comments
Hi @dhiepler -- I also saw your trouble ticket at NERSC, let me respond here so that the community can weigh in. without having tried it myself (I am currently traveling, and have bad internet -- and alas my burner laptop doesn't have Julia), here are some ideas:
I will check these myself soon, but maybe @andreasnoack has some ideas in the meantime. Cheers, |
I have more information thanks also to @dhiepler, here is an expanded test program: using Distributed
using Elemental
using DistributedArrays
using LinearAlgebra
@everywhere using Random
@everywhere Random.seed!(123)
A = drandn(10,10)
Al = Matrix(A)
println("sum of diffs= ",sum(A-Al))
println("Determinant of singular matrix is 0")
#println("det(A)= ", det(A)) # not implemented for DArray
println("det(Al)= ", det(Al))
a = svdvals(Al)
b = Elemental.svdvals(A)
println("a= ", a)
println("b= ", b)
println("a-b= ", a-b)
println("a./b= ", a./b)
#println("det(a) = ", det(a))
println("sum of ratios after svd = ",sum(a./b))
println("sum of diffs after svd = ",sum(a-b))
println("approx= ", a ≈ b) What I found was rather startling: it works fine on my laptop with more than one MPI rank. When I run it using MPI on Cori, then I get a constant factor difference in each element: One rank:
Two ranks
Four ranks
My crude guess would be that there is a factor difference of |
I can reproduce the bug on my laptop. My guess is that this is an issue with the conversion between Elemental arrays and DistributedArrays.DArray. If I just use the Elemental's using Elemental
using LinearAlgebra
A = Elemental.DistMatrix(Float64)
Elemental.gaussian!(A, 10, 10)
Al = Array(A)
println("sum of diffs= ", sum(A - Al))
a = svdvals(Al)
b = svdvals(A)
println("a = ", a)
println("b = ", b)
println("a - b= ", a - b)
println("a ./ b= ", a ./ b)
@info "DONE!" |
Thanks @andreasnoack -- I just verified that this also works on Cori. This leaves an open question given what @dhiepler is trying to do with the original code: How to create an Elemental array from an existing array (if not from a DArray) ? |
On #77 resulted for me in:
Meaning that the You can check yourself with:
After switching to MPICH_jll:
I got:
When JuliaParallel/MPI.jl#513 lands we will be able to fix issues like this. Now:
yields |
Thanks @vchuravy for looking into this also. Unfortunately it doesn't work on Cori. Is the Taking out the
using the system MPICH (i.e. cray-mpich). |
Yeah the MPIClusterManager is necessary for the variant of the code that uses DistributedArrays otherwise the ranks are not wired-up correctly. |
Oh Feck! So now I have to fix |
Please do :) |
Done! |
@vchuravy Based on our discussion in JuliaParallel/MPIClusterManagers.jl#26 your solution was to run the using Elemental throws
when trying to initialize Elemental.jl ... we appear to have a catch 22 :( |
And if I avoid the using MPIClusterManagers, Distributed
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)
using Elemental
using LinearAlgebra
using DistributedArrays
A = drandn(50,50)
Al = Matrix(A)
a = svdvals(Al)
b = Array(Elemental.svdvals(A))
println("sum of diffs= ",sum(a-b))
MPIClusterManagers.stop_main_loop(manager) I get a deadlock. |
Without srun, but with salloc? |
Elemental needs |
I am confused. MPIClusterManager should use |
I don't know all the details at this point, but if I run Elemental (remember, we don't use the one provided by BinaryBuilder -- that one doesn't use Cray MPICH, and therefore doesn't work on Cori -- but instead we build our own version using the Cray compiler wrappers) without
This is a common error when a program runs Here is a theory that I will try next: would wrapping |
Oh yeah that makes sense. That is rather unfriendly behavior on the Cray side... We still load the Elemental library on the front-end process. |
Maybe we can use the PMI cluster manager I wrote https://github.com/JuliaParallel/PMI.jl/blob/main/examples/distributed.jl |
Alas @vchuravy Cray MPICH wants PMI2 :(
|
My goal is to add svd(A::DArray) to the julia interface to elemental. I have been looking at svdvals
for understanding how to do it. But I do not know how to use svdvals in parallel. The following program
has a sum of diffs of close to 0 when using a single process. When I run it with julia -p 2 or mpiexecjl -n 2 the sum of diffs is large.
How do I convert the program to run in parallel ?
The text was updated successfully, but these errors were encountered: