Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2x faster tuple sorting (for 2-Tuples of Integers) #48724

Closed
LSchwerdt opened this issue Feb 19, 2023 · 0 comments · Fixed by #48753
Closed

2x faster tuple sorting (for 2-Tuples of Integers) #48724

LSchwerdt opened this issue Feb 19, 2023 · 0 comments · Fixed by #48753
Labels
performance Must go faster sorting Put things in order

Comments

@LSchwerdt
Copy link
Contributor

LSchwerdt commented Feb 19, 2023

TDLR: isless for 2-Tuples of Integers is significantly faster if implemented by packing both values into a longer Integer.

While experimenting with ways to improve the performance of sortperm, I noticed that sorting tuples of UInt64 is much slower than sorting UInt128s. But tuples of UInt64s can be packed into a single UInt128:

pack(x::Tuple{UInt64, UInt64}) = UInt128(x[1])<<64 | x[2]

Sorting 10^7 tuples using by=pack is almost twice as fast on my machine compared to the default.

This can also be seen in the assembly code of the default isless and a modified version with packed integers. Not only is the packed version shorter, it is also branchless:

Assembly for default `isless`
# %bb.0:                                # %top
      pushq   %rbp
      .cfi_def_cfa_offset 16
      .cfi_offset %rbp, -16
      movq    %rsp, %rbp
      .cfi_def_cfa_register %rbp
; │ @ tuple.jl:541 within `isless` @ operators.jl:421
; │┌ @ int.jl:487 within `<`
      movq    (%rcx), %r8
      movq    (%rdx), %r9
      movb    $1, %al
      cmpq    %r9, %r8
; │└
; │ @ tuple.jl:541 within `isless`
      jb      .LBB0_4
# %bb.1:                                # %L6
      jne     .LBB0_2
# %bb.3:                                # %L8
; │ @ tuple.jl:541 within `isless` @ tuple.jl:541 @ operators.jl:421
; │┌ @ int.jl:487 within `<`
      movq    8(%rcx), %rax
      cmpq    8(%rdx), %rax
      setb    %al
.LBB0_4:                                # %common.ret
; │└
; │ @ tuple.jl:541 within `isless`
                                      # kill: def $al killed $al killed $eax       
      popq    %rbp
      retq
.LBB0_2:
      xorl    %eax, %eax
                                      # kill: def $al killed $al killed $eax       
      popq    %rbp
      retq
.Lfunc_end0:
      .size   julia_isless_1264, .Lfunc_end0-julia_isless_1264
      .cfi_endproc
; └
                                      # -- End function
Assembly for packed `isless`
# %bb.0:                                # %top
      pushq   %rbp
      .cfi_def_cfa_offset 16
      .cfi_offset %rbp, -16
      movq    %rsp, %rbp
      .cfi_def_cfa_register %rbp
; │┌ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:17 within `pack`
; ││┌ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:5 within `shiftwiden`      
; │││┌ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:5 within `macro expansion`
; ││││┌ @ operators.jl:882 within `widen`
; │││││┌ @ number.jl:7 within `convert`
; ││││││┌ @ boot.jl:790 within `UInt128`
; │││││││┌ @ boot.jl:775 within `toUInt128`
      movq    (%rcx), %rax
; ││└└└└└└
; ││┌ @ int.jl:1040 within `|`
; │││┌ @ int.jl:525 within `rem`
; ││││┌ @ number.jl:7 within `convert`
; │││││┌ @ boot.jl:790 within `UInt128`
; ││││││┌ @ boot.jl:775 within `toUInt128`
      movq    8(%rcx), %rcx
; │└└└└└└
; │ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:21 within `isless` @ operators.jl:421
; │┌ @ int.jl:487 within `<`
      cmpq    8(%rdx), %rcx
      sbbq    (%rdx), %rax
      setb    %al
; │└
; │ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:21 within `isless`
      popq    %rbp
      retq
.Lfunc_end0:
      .size   julia_isless_1694, .Lfunc_end0-julia_isless_1694
      .cfi_endproc
; └
                                      # -- End function

So I generalized this for all Integers up to 64 bit, and using

this benchmarking script
using BenchmarkTools
using NamedArrays

n = 10^7;
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)

packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64]
N = length(packabletypes)

# benchmark base
times1 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    times1[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1
end

# add isless definitions
import Base.isless
# widen a until both a and b fit, then shift a into the higher bits
@generated function shiftwiden(a,b)
    sizea = sizeof(a)
    ex = :a
    while sizea < sizeof(b)
        ex = :(widen($ex))
        sizea *= 2
    end
    :(widen($ex) << (8*$sizea))    
end
maybe_flipsignbit(x::T) where T<:Unsigned = x
maybe_flipsignbit(x::T) where T<:Signed = reinterpret(unsigned(T), x  typemin(T))

pack(t::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} = 
shiftwiden(t...) | maybe_flipsignbit(t[2])

isless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =
 isless(pack(a), pack(b))

# benchmark modified
times2 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    times2[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1
end

speedup = NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup

I get the following speedups for sorting 10^7 tuples, depending on the element types:

speedup:
8×8 Named Matrix{Float64}
 A ╲ B │  UInt8    Int8  UInt16   Int16  UInt32   Int32  UInt64   Int64
───────┼───────────────────────────────────────────────────────────────
UInt8  │   2.39     2.2    2.12    1.98     1.7    1.67     1.0   0.995
Int8   │   2.15    1.91    1.88    1.73    1.55    1.58   0.968   0.966
UInt16 │   2.32    2.24    2.35    2.21    2.21    2.02    1.67    1.65
Int16  │   2.09    1.96    2.16    2.02    1.95    1.91    1.67    1.52
UInt32 │   1.82    1.82     2.1    2.03    2.22    2.05    1.96    1.75
Int32  │   1.79    1.65    1.86    1.78     1.9    1.81     1.7     1.6
UInt64 │   1.08    1.09    1.62    1.62    1.94    1.83    1.96    1.82
Int64  │  0.969   0.983    1.56    1.57    1.77    1.67    1.72    1.63
(on this system)
Julia Version 1.9.0-beta4
Commit b75ddb787f (2023-02-07 21:53 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 24 × AMD Ryzen 9 3900X 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 12 on 24 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 12

Using a sorting function with branchless comparisons, the speedup is even bigger:

speedup:
8×8 Named Matrix{Float64}
 A ╲ B │  UInt8    Int8  UInt16   Int16  UInt32   Int32  UInt64   Int64
───────┼───────────────────────────────────────────────────────────────
UInt8  │   2.78    2.37     1.9     1.8    1.73    1.64    1.47    1.47
Int8   │   2.52    2.13    1.69    1.58    1.56    1.52    1.38    1.35
UInt16 │   2.43    2.29    2.51    2.37    2.44    2.29    2.16    2.13
Int16  │   2.22    2.21    2.35     2.2    2.18    2.04    2.01    1.96
UInt32 │   2.35    2.26    2.53    2.38    2.67    2.46    2.42    2.26
Int32  │   2.22    2.13    2.27    2.12    2.38     2.2    2.18     2.2
UInt64 │   1.76    1.79    2.19    2.15    2.32    2.17    2.26    2.28
Int64  │   1.72    1.72    2.09     2.0    2.15    2.07     2.2    2.15

This optimization seems not to be useful for longer tuples, because the values are reversed in memory.

Here is a script to test the correctness.
using Test

packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64]

@generated function shiftwiden(a,b)
    sizea = sizeof(a)
    ex = :a
    while sizea < sizeof(b)
        ex = :(widen($ex))
        sizea *= 2
    end
    :(widen($ex) << (8*$sizea))    
end
maybe_flipsignbit(x::T) where T<:Unsigned = x
maybe_flipsignbit(x::T) where T<:Signed = reinterpret(unsigned(T), x  typemin(T))

pack(t::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} = 
shiftwiden(t...) | maybe_flipsignbit(t[2])

newisless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =
 isless(pack(a), pack(b))

getTestvals(T::Type{TT}) where {TT<:Unsigned} = [zero(T), one(T), typemax(T)  one(T), typemax(T)]
getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), typemin(T)+1, -one(T), zero(T), one(T), typemax(T)-1, typemax(T)]

for T1 in packabletypes, T2 in packabletypes
    vals1 = getTestvals(T1)
    vals2 = getTestvals(T2)
    for val_1_1 in vals1, val_2_1 in vals1
        for val_1_2 in vals2, val_2_2 in vals2
            @test isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == newisless((val_1_1,val_1_2),(val_2_1,val_2_2))
        end
    end
end

Is it worth it to inlude this optimization in base Julia?

Edit:

Code amenable to SIMD vectorization benefits even more:
using BenchmarkTools
using NamedArrays

versioninfo()
n = 10^3;
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)

packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64]
N = length(packabletypes)

function benchmarkfun!(v,f)
    b = 0
    for i in eachindex(v)
        for j in eachindex(v)
        b += f(v[i],v[j])
        end
    end
    b
end

# benchmark base
times1 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    b = zeros(Int,n)
    times1[i,j] = @belapsed benchmarkfun!($v,isless) setup=(randomTuples!($v)) evals=1
end

# add isless definitions
# widen a until both a and b fit, then shift a into the higher bits
@generated function shiftwiden(a,b)
    sizea = sizeof(a)
    ex = :a
    while sizea < sizeof(b)
        ex = :(widen($ex))
        sizea *= 2
    end
    :(widen($ex) << (8*$sizea))    
end
maybe_flipsignbit(x::T) where T<:Unsigned = x
maybe_flipsignbit(x::T) where T<:Signed = reinterpret(unsigned(T), x  typemin(T))

pack(t::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} = 
shiftwiden(t...) | maybe_flipsignbit(t[2])

newisless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =
 isless(pack(a), pack(b))

# benchmark modified
times2 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    b = zeros(Int,n)
    times2[i,j] = @belapsed benchmarkfun!($v,newisless) setup=(randomTuples!($v)) evals=1
end

speedup = NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup
speedup:
8×8 Named Matrix{Float64}
 A ╲ B │  UInt8    Int8  UInt16   Int16  UInt32   Int32  UInt64   Int64
───────┼───────────────────────────────────────────────────────────────
UInt8  │   7.98    5.76    7.65    6.05    3.92     3.8    3.43    2.47
Int8   │    9.7    6.72    6.19    5.12     3.9    3.81    2.47    1.96
UInt16 │   5.86     5.0     8.0    5.72    4.23    4.12    3.41    2.63
Int16  │   6.53     5.7    9.36    6.59    4.18     4.2    2.51    2.08
UInt32 │    4.0    4.01    4.24    4.05    4.35    4.55    2.59     1.8
Int32  │   4.15     4.1    4.38     4.2    5.14    4.73    2.04    1.61
UInt64 │   3.41    2.51    3.49    2.79    2.42    2.33    2.61    2.63
Int64  │   2.48    2.35    2.57    2.46    1.83    1.83    2.01    2.06
Extra benchmark of the regular sort! on different hardware
Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i7-7820X CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 1 on 16 virtual cores

speedup:
8×8 Named Matrix{Float64}
 A ╲ B │  UInt8    Int8  UInt16   Int16  UInt32   Int32  UInt64   Int64
───────┼───────────────────────────────────────────────────────────────
UInt8  │   2.43    1.97    2.11    1.64    1.65    1.62    1.14    1.07
Int8   │   1.97     2.0    1.88    1.89    1.61    1.59     1.1    1.12
UInt16 │   2.25    2.07    2.35    2.12    2.04    2.14    1.77    1.68
Int16  │   1.81    1.92    2.29    2.06    1.89    1.95    1.76    1.69
UInt32 │   1.72    1.62    2.07    1.84    2.17    1.87    1.95    1.77
Int32  │   1.77    1.75    2.02    1.91    1.94    1.92    1.82    1.65
UInt64 │   1.17    1.16    1.55    1.61    1.96    1.73    1.91    1.79
Int64  │   1.15    1.12    1.62    1.55    1.63    1.56     1.9    1.74
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster sorting Put things in order
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants