Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to go from files to distributed matrix. #74

Open
dh-ilight opened this issue Oct 11, 2021 · 3 comments
Open

How to go from files to distributed matrix. #74

dh-ilight opened this issue Oct 11, 2021 · 3 comments

Comments

@dh-ilight
Copy link

dh-ilight commented Oct 11, 2021

I have files each holding 1 column of an array. I would like to create an Elemental.DistMatrix from these files.
I would like to load the DistMatrix in parallel.
An earlier question was answered by pointing to Elemental/test/lav.jl
I made the following program by extracting from lav.jl. It works for a single node and hangs for 2 nodes using mpiexecjl.
I am using Julia 1.5 on a 4 core machine running Centos 7.5
Please let me know what is wrong with the program and how to load my column array files in parallel.
I intend to eventually run a program using DistMatrix on a computer with hundreds of cores.

# to import MPIManager
using MPIClusterManagers, Distributed

# Manage MPIManager manually -- all MPI ranks do the same work
# Start MPIManager
manager = MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL)

# Init an Elemental.DistMatrix
@everywhere function spread(n0, n1)
println("start spread")
height = n0*n1
width = n0*n1
h= El.Dist(n0)
w= El.Dist(n1)
A = El.DistMatrix(Float64)
El.gaussian!(A, n0, n1) # how to init size ?
localHeight = El.localHeight(A)
println("localHeight ", localHeight)
El.reserve(A, 6*localHeight) # number of queue entries
println("after reserve")
for sLoc in 1:localHeight
s = El.globalRow(A, sLoc)
x0 = ((s-1) % n0) + 1
x1 = div((s-1), n0) + 1
El.queueUpdate(A, s, s, 11.0)
println("sLoc $sLoc, x0 $x0")
if x0 > 1
El.queueUpdate(A, s, s - 1, -10.0)
println("after q")
end
if x0 < n0
El.queueUpdate(A, s, s + 1, 20.0)
end
if x1 > 1
El.queueUpdate(A, s, s - n0, -30.0)
end
if x1 < n1
El.queueUpdate(A, s, s + n0, 40.0)
end
# The dense last column
# El.queueUpdate(A, s, width, floor(-10/height))
end # for
println("before processQueues")
El.processQueues(A)
println("after processQueues") # with 2 nodes never gets here
return A
end

@mpi_do manager begin
using MPI, LinearAlgebra, Elemental
const El = Elemental
res = spread(4,4)
println( "res=" , res)

# Manage MPIManager manually:
# Elemental needs to be finalized before shutting down MPIManager
# println("[rank $(MPI.Comm_rank(comm))]: Finalizing Elemental")
Elemental.Finalize()
# println("[rank $(MPI.Comm_rank(comm))]: Done finalizing Elemental")
end # mpi_do

# Shut down MPIManager
MPIClusterManagers.stop_main_loop(manager)

Thank you

@JBlaschke
Copy link

JBlaschke commented Oct 11, 2021

Based on a NERSC user ticket which inspired #73 @andreasnoack

@dhiepler can you put the code snippet in a code block (put ```julia at the beginning and ``` at the end)

@andreasnoack
Copy link
Member

The program looks right to me. To debug this, I'd try to remove the MPIClusterManagers, Distributed parts and then run the script with mpiexec like we do in

run(`$exec -np $nprocs $(Base.julia_cmd()) $(joinpath(@__DIR__, f))`)

@JBlaschke
Copy link

FTR @dhiepler on Cori that would be

srun -n $NUM_RANKS julia path/to/test.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants