-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Equivalent of cuSparse - start with sparse matvec #208
Comments
Metal's performance shaders library does not support sparse arrays. Apple Accelerate does, but that's for the CPU. Maybe that's good enough, though (with the memory being unified anyway)? A generic implementation would be nice, but I don't have much experience with sparse algorithms. What operations would be important? |
There is a
|
What makes |
I guess I am trying to figure out what is the right programming model to keep in mind here would be. Getting a fast sparse matvec (and getting Conjugate Gradient working) followed by a fast matmul would be a good starting point to explore what is possible. I'll experiment with a few things and see how far I can get. |
There's some native kernels I wrote in CUDA.jl, https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cusparse/broadcast.jl, which use row/column iterators that 'zip' the multiple inputs. Thus, they parallelize across one dimension of the sparse input. Multiplication is much more difficult though, as there isn't a straightforward dimension to accelerate over. (The crux of the issue is that we cannot have an efficient |
Also note that those CUDA.jl kernels are ideally suited to be ported to GPUArrays using KA.jl, once we start doing that, as they don't use any advanced CUDA features. |
From After that, a sparse matmul is valuable. Since CUDA uses CSR, perhaps we could just use that for Metal.jl as well. |
I am a bit lost on how to start working on such an implementation. Shall I use the kernel programming capabilities of |
Implementing this in GPUArrays.jl using KernelAbstractions.jl (starting from JuliaGPU/GPUArrays.jl#525) is probably the best thing. One complication is where and how to define the concrete sparse array types that will be needed for this; I guess we will need some new type hierarchy for host and device sparse arrays backed by GPU memory that packages like CUDA.jl can then provide concrete implementations for (but with well-defined interfaces such that generic kernels defined in GPUArrays.jl can operate on them). |
After diving a bit into approaches of how to parallelize sparse matrix-vector multiplications (spmv) and sparse matrix-matrix multiplications (spmm) I realized that this is a way more complicated (and rich) field than I thought last week. Julia's standard format for sparse matrices is CSC. Naively parallizing the serial mv multiplication algorithm
results in a race condition due to simultaneous writes into
Of course, this is not optimal as the accesses to I tried to check what NVIDIA does with the CSC format. I do not have direct access to a NVIDIA device currently, so I tried google colab. I could not find any significant runtime differences of spmv between CSC and CSR spmv (see this gist). So apparently, NVIDIA has found a good solution for spmv with CSC format. |
I follow this issue. It would be very useful to have support for CSC and CSR matrices, with at least basic operations like + and * between matrices, and matrix-vector multiplication. |
I believe there is currently no sparse matrix capability in Metal.jl. What is the easiest way to get some basic things working?
Perhaps a bigger question is whether we can have a generic sparse matrix implementation that can work on all our GPU backends.
The text was updated successfully, but these errors were encountered: