You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TDLR: isless for 2-Tuples of Integers is significantly faster if implemented by packing both values into a longer Integer.
While experimenting with ways to improve the performance of sortperm, I noticed that sorting tuples of UInt64 is much slower than sorting UInt128s. But tuples of UInt64s can be packed into a single UInt128:
Sorting 10^7 tuples using by=pack is almost twice as fast on my machine compared to the default.
This can also be seen in the assembly code of the default isless and a modified version with packed integers. Not only is the packed version shorter, it is also branchless:
# %bb.0: # %top pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset %rbp,-16movq %rsp, %rbp .cfi_def_cfa_register %rbp; │┌ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:17 within `pack`; ││┌ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:5 within `shiftwiden` ; │││┌ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:5 within `macro expansion`; ││││┌ @ operators.jl:882 within `widen`; │││││┌ @ number.jl:7 within `convert`; ││││││┌ @ boot.jl:790 within `UInt128`; │││││││┌ @ boot.jl:775 within `toUInt128`movq (%rcx), %rax; ││└└└└└└; ││┌ @ int.jl:1040 within `|`; │││┌ @ int.jl:525 within `rem`; ││││┌ @ number.jl:7 within `convert`; │││││┌ @ boot.jl:790 within `UInt128`; ││││││┌ @ boot.jl:775 within `toUInt128`movq8(%rcx), %rcx; │└└└└└└; │ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:21 within `isless` @ operators.jl:421; │┌ @ int.jl:487 within `<` cmpq 8(%rdx), %rcx sbbq (%rdx), %raxsetb %al; │└; │ @ d:\Julia_dev\sorting\tuplecompare\widen3.jl:21 within `isless` popq %rbp retq.Lfunc_end0: .size julia_isless_1694, .Lfunc_end0-julia_isless_1694 .cfi_endproc; └ # -- End function
So I generalized this for all Integers up to 64 bit, and using
this benchmarking script
using BenchmarkTools
using NamedArrays
n =10^7;
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} =map!(_->(rand(T1),rand(T2)),v,v)
packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64]
N =length(packabletypes)
# benchmark base
times1 =zeros(N,N)
for (i,T1) inenumerate(packabletypes), (j,T2) inenumerate(packabletypes)
v =zip(rand(T1,n),rand(T2,n)) |> collect;
times1[i,j] =@belapsedsort!($v) setup=(randomTuples!($v)) evals=1end# add isless definitionsimport Base.isless
# widen a until both a and b fit, then shift a into the higher bits@generatedfunctionshiftwiden(a,b)
sizea =sizeof(a)
ex =:awhile sizea <sizeof(b)
ex = :(widen($ex))
sizea *=2end
:(widen($ex) << (8*$sizea))
endmaybe_flipsignbit(x::T) where T<:Unsigned= x
maybe_flipsignbit(x::T) where T<:Signed=reinterpret(unsigned(T), x ⊻typemin(T))
pack(t::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =shiftwiden(t...) |maybe_flipsignbit(t[2])
isless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =isless(pack(a), pack(b))
# benchmark modified
times2 =zeros(N,N)
for (i,T1) inenumerate(packabletypes), (j,T2) inenumerate(packabletypes)
v =zip(rand(T1,n),rand(T2,n)) |> collect;
times2[i,j] =@belapsedsort!($v) setup=(randomTuples!($v)) evals=1end
speedup =NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup
I get the following speedups for sorting 10^7 tuples, depending on the element types:
This optimization seems not to be useful for longer tuples, because the values are reversed in memory.
Here is a script to test the correctness.
using Test
packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64]
@generatedfunctionshiftwiden(a,b)
sizea =sizeof(a)
ex =:awhile sizea <sizeof(b)
ex = :(widen($ex))
sizea *=2end
:(widen($ex) << (8*$sizea))
endmaybe_flipsignbit(x::T) where T<:Unsigned= x
maybe_flipsignbit(x::T) where T<:Signed=reinterpret(unsigned(T), x ⊻typemin(T))
pack(t::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =shiftwiden(t...) |maybe_flipsignbit(t[2])
newisless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =isless(pack(a), pack(b))
getTestvals(T::Type{TT}) where {TT<:Unsigned} = [zero(T), one(T), typemax(T) ⊻one(T), typemax(T)]
getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), typemin(T)+1, -one(T), zero(T), one(T), typemax(T)-1, typemax(T)]
for T1 in packabletypes, T2 in packabletypes
vals1 =getTestvals(T1)
vals2 =getTestvals(T2)
for val_1_1 in vals1, val_2_1 in vals1
for val_1_2 in vals2, val_2_2 in vals2
@testisless((val_1_1,val_1_2),(val_2_1,val_2_2)) ==newisless((val_1_1,val_1_2),(val_2_1,val_2_2))
endendend
Is it worth it to inlude this optimization in base Julia?
Edit:
Code amenable to SIMD vectorization benefits even more:
using BenchmarkTools
using NamedArrays
versioninfo()
n =10^3;
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} =map!(_->(rand(T1),rand(T2)),v,v)
packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64]
N =length(packabletypes)
functionbenchmarkfun!(v,f)
b =0for i ineachindex(v)
for j ineachindex(v)
b +=f(v[i],v[j])
endend
b
end# benchmark base
times1 =zeros(N,N)
for (i,T1) inenumerate(packabletypes), (j,T2) inenumerate(packabletypes)
v =zip(rand(T1,n),rand(T2,n)) |> collect;
b =zeros(Int,n)
times1[i,j] =@belapsedbenchmarkfun!($v,isless) setup=(randomTuples!($v)) evals=1end# add isless definitions# widen a until both a and b fit, then shift a into the higher bits@generatedfunctionshiftwiden(a,b)
sizea =sizeof(a)
ex =:awhile sizea <sizeof(b)
ex = :(widen($ex))
sizea *=2end
:(widen($ex) << (8*$sizea))
endmaybe_flipsignbit(x::T) where T<:Unsigned= x
maybe_flipsignbit(x::T) where T<:Signed=reinterpret(unsigned(T), x ⊻typemin(T))
pack(t::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =shiftwiden(t...) |maybe_flipsignbit(t[2])
newisless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{packabletypes...}, T2<:Union{packabletypes...}} =isless(pack(a), pack(b))
# benchmark modified
times2 =zeros(N,N)
for (i,T1) inenumerate(packabletypes), (j,T2) inenumerate(packabletypes)
v =zip(rand(T1,n),rand(T2,n)) |> collect;
b =zeros(Int,n)
times2[i,j] =@belapsedbenchmarkfun!($v,newisless) setup=(randomTuples!($v)) evals=1end
speedup =NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup
TDLR:
isless
for 2-Tuples of Integers is significantly faster if implemented by packing both values into a longer Integer.While experimenting with ways to improve the performance of
sortperm
, I noticed that sorting tuples of UInt64 is much slower than sorting UInt128s. But tuples of UInt64s can be packed into a single UInt128:Sorting 10^7 tuples using by=pack is almost twice as fast on my machine compared to the default.
This can also be seen in the assembly code of the default isless and a modified version with packed integers. Not only is the packed version shorter, it is also branchless:
Assembly for default `isless`
Assembly for packed `isless`
So I generalized this for all Integers up to 64 bit, and using
this benchmarking script
I get the following speedups for sorting 10^7 tuples, depending on the element types:
(on this system)
Using a sorting function with branchless comparisons, the speedup is even bigger:
This optimization seems not to be useful for longer tuples, because the values are reversed in memory.
Here is a script to test the correctness.
Is it worth it to inlude this optimization in base Julia?
Edit:
Code amenable to SIMD vectorization benefits even more:
Extra benchmark of the regular sort! on different hardware
The text was updated successfully, but these errors were encountered: