Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mul! performance regression for sparse matrices in v1.11.0-DEV.1035 #52429

Closed
dmbates opened this issue Dec 6, 2023 · 8 comments · Fixed by JuliaSparse/SparseArrays.jl#486
Closed
Labels
linear algebra Linear algebra performance Must go faster regression Regression in behavior compared to a previous version sparse Sparse arrays

Comments

@dmbates
Copy link
Member

dmbates commented Dec 6, 2023

We have noticed that one of the tests in MixedModels.jl has had a dramatic increase in execution time in recent DEV versions (locally compiled on an Mac M1 but the same performance regression is seen on Intel-based computers.) The test is called grouseticks.

A big part of the problem seems to be a performance regression in mul! when C is a sparse matrix, A is a sparse matrix and B is the transpose of a sparse matrix. A MWE is

using BenchmarkTools, LinearAlgebra, SparseArrays

Is = repeat(1:10; inner=200);
Js = repeat(1:100; inner=20);
Ks = repeat(1:1000; inner=2);
V = ones(2000);
A = sparse(Is, Ks, V);
B = sparse(Js, Ks, V);
C = sparse(Is, Js, V);

@benchmark mul!(c, a, b', -0.001, 1.0) seconds=1 setup=(a=A; b=B; c=C))

Under 1.10.0-rc2 the benchmark on my M1 laptop runs in around 0.5 ms and under 1.11.0-DEV.1035 it takes about 8 ms

@giordano
Copy link
Contributor

giordano commented Dec 6, 2023

Probably due to the excision of SparseArrays from the sysimage? But that might affect only compile time, if this is reproducible after compilation it might be something else.

@giordano giordano added performance Must go faster regression Regression in behavior compared to a previous version sparse Sparse arrays labels Dec 6, 2023
@dmbates
Copy link
Member Author

dmbates commented Dec 7, 2023

I believe it is due to a change in algorithm for the function _generic_matmatmul! defined in LinearAlgebra/src/matmul.jl. When profiling a similar calculation the majority of the time under 1.11.0-DEV is spent in that function and the time spent in this function is much greater than that under 1.10.0-rc2 and earlier versions.

@giordano giordano added the linear algebra Linear algebra label Dec 7, 2023
@aravindh-krishnamoorthy
Copy link
Contributor

See also #52137

@dkarrasch
Copy link
Member

The major refactoring (for significantly reduced load times of SparseArrays.jl) happened in v1.10, I'm not even sure we've had any major merged PRs in SparseArrays.jl in the v1.11 cycle. In that mentioned refactoring, the underlying multiplication algorithms have remained untouched, it's only the dispatch that might have changed, but if it has, that was clearly not the intention.

@dkarrasch
Copy link
Member

dkarrasch commented Dec 8, 2023

Ok, I suspect it's 0cf2bf1 (#52038) that has led to the increase in runtime. The reason is that the case of sparse output for sparse-sparse multiplication has always been handled by the most generic LinearAlgebra._generic_matmatmul!. Specialized multiplication kernels exist only for strided output.

julia> @which LinearAlgebra.generic_matmatmul!(c, 'N', 'T', a, b, LinearAlgebra.MulAddMul(-0.001, 1.0))
generic_matmatmul!(C::AbstractVecOrMat, tA, tB, A::AbstractVecOrMat, B::AbstractVecOrMat, _add::LinearAlgebra.MulAddMul)
     @ LinearAlgebra /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/LinearAlgebra/src/matmul.jl:767

If that's the case, then this is different from #52137, where the exact same multiplication kernel has significantly reduced performance.

EDIT: Well, that's exactly #52429 (comment).

@dkarrasch
Copy link
Member

Furthermore, I suspect that it has to do with the tiling algorithm. If I understand that one correctly, then it takes tiles of the arrays and copies them to some temporary (dense) array. The intermediate steps are then much faster than accessing sparse arrays element-by-element. Maybe we need to bring back that tiling-based algorithm and restrict the current ones to strided arrays with generic/mixed eltypes? For those cases, it's been pretty much optimized. @chriselrod

@palday
Copy link
Contributor

palday commented Dec 8, 2023

I bisected this to see where the minimum time from example @dmbates provided exceeded 5ms on my MacBook. (I get similar times to @dmbates on my laptop -- minimum time is ~0.5ms on 1.10, and >>5ms on nightly).

0cf2bf143552c2f425150fcb2a60225c67ea042d is the first bad commit
commit 0cf2bf143552c2f425150fcb2a60225c67ea042d
Author: Daniel Karrasch <[email protected]>
Date:   Tue Nov 14 22:58:03 2023 +0100

    Reduce compile time for generic matmatmul (#52038)
    
    This is another attempt at improving the compile time issue with generic
    matmatmul, hopefully improving runtime performance also.
    
    @chriselrod @jishnub
    
    There seems to be a little typo/oversight somewhere, but it shows how it
    could work. Locally, this reduces benchmark times from
    https://github.com/JuliaLang/julia/pull/51812#issuecomment-1780394475 by
    more than 50%.
    
    ---------
    
    Co-authored-by: Chris Elrod <[email protected]>

@dkarrasch
Copy link
Member

Perhaps we should move the tiling-based algorithm to SparseArrays.jl? It performed worse than what we have now for strided arrays with generic and/or mixed eltypes. So the real benefit of it seems to be in locally densifying sparse arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra performance Must go faster regression Regression in behavior compared to a previous version sparse Sparse arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants