Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in sparse-dense multiplication #52137

Open
dkarrasch opened this issue Nov 12, 2023 · 6 comments
Open

Regression in sparse-dense multiplication #52137

dkarrasch opened this issue Nov 12, 2023 · 6 comments
Labels
linear algebra Linear algebra performance Must go faster regression Regression in behavior compared to a previous version

Comments

@dkarrasch
Copy link
Member

dkarrasch commented Nov 12, 2023

Running this piece of code, I get

using LinearAlgebra, SparseArrays, BenchmarkTools

function _spmatmul!(C, A, B, α, β)
    size(A, 2) == size(B, 1) || throw(DimensionMismatch())
    size(A, 1) == size(C, 1) || throw(DimensionMismatch())
    size(B, 2) == size(C, 2) || throw(DimensionMismatch())
    nzv = nonzeros(A)
    rv = rowvals(A)
    β != one(β) && LinearAlgebra._rmul_or_fill!(C, β)
    for k in 1:size(C, 2)
        @inbounds for col in 1:size(A, 2)
            αxj = B[col,k] * α
            for j in nzrange(A, col)
                C[rv[j], k] += nzv[j]*αxj
            end
        end
    end
    C
end

n = 1000;
A = sparse(Float64, I, n, n);
b = Vector{Float64}(undef, n);
x = ones(n);
@btime _spmatmul!($b, $A, $x, true, false);
#  1.703 μs (0 allocations: 0 bytes) on v1.9.3
# 2.161 μs (0 allocations: 0 bytes) on v1.10.0-rc1
# 2.945 μs (0 allocations: 0 bytes) on an 11 days old master

Multiplication dispatch has changed over the 1.9 to 1.10 transition, but this is nothing but the barebone mutliplication code that we used to have "ever since", so without character processing and all that. And since this is not calling high-level functions, the issue must be outside of SparseArrays.jl, AFAIU.

x-ref JuliaSparse/SparseArrays.jl#469

@dkarrasch dkarrasch added performance Must go faster linear algebra Linear algebra regression Regression in behavior compared to a previous version labels Nov 12, 2023
@aravindh-krishnamoorthy
Copy link
Contributor

aravindh-krishnamoorthy commented Nov 13, 2023

An initial analysis below. Firstly, I used PProf to profile the function _spmatmul! for v1.9.3 and 85d7cca. Then, I compared the Intel assembly obtained via @code_native (still underway).

PProf

Details v1.9.3: ```julia _spmatmul!(::Vector{Float64}, ::SparseMatrixCSC{Float64, Int64}, ::Vector{Float64}, ::Bool, ::Bool) /home/ark/exp/52137.jl

Total: 52298 53841 (flat, cum) 99.86%
0 1053 1053
1 . . using LinearAlgebra, SparseArrays, BenchmarkTools
2 . .
3 46 46 function _spmatmul!(C, A, B, α, β)
4 115 115 size(A, 2) == size(B, 1) || throw(DimensionMismatch())
5 7 7 size(A, 1) == size(C, 1) || throw(DimensionMismatch())
6 . . size(B, 2) == size(C, 2) || throw(DimensionMismatch())
7 . . nzv = nonzeros(A)
8 . . rv = rowvals(A)
9 . 1543 β != one(β) && LinearAlgebra._rmul_or_fill!(C, β)
10 . . for k in 1:size(C, 2)
11 . . @inbounds for col in 1:size(A, 2)
12 . . αxj = B[col,k] * α
13 10181 10181 for j in nzrange(A, col)
14 32632 32632 C[rv[j], k] += nzv[j]*αxj
15 7388 7388 end
16 853 853 end
17 . . end
18 23 23 C
19 . . end
20 . .

85d7cca:
```julia
_spmatmul!(::Vector{Float64}, ::SparseMatrixCSC{Float64, Int64}, ::Vector{Float64}, ::Bool, ::Bool)
/home/ark/exp/52137.jl

  Total:       76441      77940 (flat, cum)   100%
      0         8239       8239           <instructions with unknown line numbers> 
      1            .          .           using LinearAlgebra, SparseArrays, BenchmarkTools 
      2            .          .            
      3          135        135           function _spmatmul!(C, A, B, α, β) 
      4           77         77               size(A, 2) == size(B, 1) || throw(DimensionMismatch()) 
      5            .          .               size(A, 1) == size(C, 1) || throw(DimensionMismatch()) 
      6            .          .               size(B, 2) == size(C, 2) || throw(DimensionMismatch()) 
      7            .          .               nzv = nonzeros(A) 
      8            .          .               rv = rowvals(A) 
      9           15       1514               β != one(β) && LinearAlgebra._rmul_or_fill!(C, β) 
     10            .          .               for k in 1:size(C, 2) 
     11            2          2                   @inbounds for col in 1:size(A, 2) 
     12            .          .                       αxj = B[col,k] * α 
     13        10453      10453                       for j in nzrange(A, col) 
     14        34117      34117                           C[rv[j], k] += nzv[j]*αxj 
     15        22692      22692                       end 
     16          690        690                   end 
     17            .          .               end 
     18           21         21               C 
     19            .          .           end 
     20            .          .            

Observations

  1. Both lines 13 nzrange and 14 (multiply) entail around the same complexity in both versions.
  2. Line 15 end seems to take a lot more cycles in 85d7cca.

Assembly

Assembly @code_native on my Intel PC, with Intel syntax, for both cases is attached: 1.9.3.asm.txt
85d7cca.asm.txt

Observations

  1. 85d7cca results in a much longer assembly and unrolling (?). The file is much longer than for v1.9.4.
  2. From the assembly files, the inner loop cycles for C[rv[j], k] += nzv[j]*αxj in both cases seem to be the same (4 instructions, uses SSE xmmX).
  3. My best guess as of today (after looking at the asm files) is that the new code in 85d7cca attempts to do a 4x loop unroll of the inner loop in Line 14 above, but ends up causing a register spill. The cycles counted in Line 15 seem to be from the register spills and reloads. (I also see that some iterator code has changed, but I feel this shouldn't cause such a huge difference in performance.)

More later!

@KristofferC
Copy link
Member

Could maybe be an LLVM upgrade where the vectorizer now does a worse job?

@aravindh-krishnamoorthy
Copy link
Contributor

aravindh-krishnamoorthy commented Nov 14, 2023

LLVM Code

Comparing the LLVM code, generated using the command @code_llvm _spmatmul!(b, A, x, true, false), we have the following observations. The files are here: 1.9.3.llvm.txt and 85d7cca.llvm.txt.

  1. The LLVM code generated for 85d7cca is longer than that for 1.9.3.
  2. 4x loop unrolling, which was observed in the native code above, seems to have already been performed in the LLVM code.

Hence, at this point, my best guess is that the 4x unrolling in LLVM code causes register spills and reloads (2x of 64 bits each) when mapping to my Intel CPU (i7-1165G7), leading to the increased cycles observed in Line 15 of the above post.

-> Any suggestions to confirm this guess are welcome!

NB: It could be that the generated native code itself is suboptimal. But, based on my initial reading, the code structure, except for minor differences in code and the 4x unrolling, seems pretty similar between the two versions.

@aravindh-krishnamoorthy
Copy link
Contributor

See also #52429

@ViralBShah
Copy link
Member

ViralBShah commented May 30, 2024

On M-series macs, I see 1.1 μs on 1.9 and 1.10 and 1.25 μs on 1.11-rc1 and 1.12-dev.

On x64, the gap is larger: 1.2 on 1.9 and 1.5 on 1.11-rc1 and 1.12-dev.

@KristofferC
Copy link
Member

KristofferC commented May 30, 2024

I bisected this to dbd82a4 (#49747).

It's a bit funny because in the PR it shows that sparse matmul seems to have gotten the biggest improvement from this change.
Also, the PR seems to only have added optimization passes so perhaps LLVM is being dumb and one of the passes makes things worse.

cc @gbaraldi, @pchintalapudi

Edit: I also noticed that the benchmark case has a diagonal sparse matrix which isn't very representative... Each nzrange(A, col) will thus only have a single element so any optimization assuming that will be more than one is wasted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

4 participants