-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fastpath for comparing tuples of two bitintegers #48753
Conversation
I don't expect this to report anything notable now, but we could add a benchmark for this case to nanosoldier. @nanosoldier |
Your job failed. |
base/tuple.jl
Outdated
# TODO: remove this when the compiler can optimize the generic version better | ||
# See #48724 and #48753 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: remove this when the compiler can optimize the generic version better | |
# See #48724 and #48753 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this TODO to it make it clear to future readers that it is okay to remove this if it no longer produces measurable performance improvements. @vtjnash, why do you prefer to remove it?
Thank you for implementing this so quickly. correctness
Test script.using Test
using Base: BitIntegerSmall_types
for T in unique((BitIntegerSmall_types..., Int, UInt))
@eval PR_isless((a1,a2)::NTuple{2, $T}, (b1,b2)::NTuple{2, $T}) =
isless(widen(a1) << 8sizeof($T) + a2, widen(b1) << 8sizeof($T) + b2)
end
getTestvals(T::Type{TT}) where {TT<:Unsigned} = [zero(T), one(T), 2one(T), typemax(T) ⊻ 3one(T), typemax(T) ⊻ one(T), typemax(T)]
#getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), typemin(T)+1, typemin(T)+2, -3one(T), -2one(T), -one(T), zero(T), one(T), 2one(T), 3one(T), typemax(T)-3, typemax(T)-2, typemax(T)-1, typemax(T)]
getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), zero(T)]
@testset begin
for T in unique((BitIntegerSmall_types..., Int, UInt))
vals = getTestvals(T)
fails = Any[]
for val_1_1 in vals, val_2_1 in vals
for val_1_2 in vals, val_2_2 in vals
isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2)) || push!(fails,((val_1_1, val_1_2),(val_2_1, val_2_2)))
#@test isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2))
end
end
println("$T, failures: $(length(fails))")
if !isempty(fails)
for x in fails
println("$x fails")
end
end
end
end implementationI think simplifying the code by omitting the @generated function to handle different sizes of the two tuple items is a reasonable choice. But even without that, the code works for all tuples with two elements of the same size*. |
I fixed that correctness issue, thanks! I also added support for heterogeneous tuples, but idk if it's worth the added complexity. I'm very open to reverting that commit. |
base/tuple.jl
Outdated
for T in unique((BitIntegerSmall_types..., Int, UInt)) | ||
@eval isless(a::NTuple{2, $T}, b::NTuple{2, $T}) = isless(_pack_tuple(a), _pack_tuple(b)) | ||
_pack_tuple_promote(a::Unsigned,b) = unsigned(promote(a,b)[1]) | ||
_pack_tuple_promote(a::Signed,b) = signed(promote(a,b)[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not get the adaptive widening to work using promotion, thats why I used the @generated function.
This can result in calling promote(UInt8(200),Int8(-1))
, which errors with InexactError: check_top_bit(UInt8, -1)
.
I think only supporting tuples with elements of the same size is a reasonable compromise. But since this optimization is probably used very seldomly, only supporting NTuples is fine. Commit cc07250 passes my testsusing Test
using Base: BitIntegerSmall_types
_pack_tuple((a,b)) = widen((a)) << 8sizeof(b) - typemin(b) + b
for T in unique((BitIntegerSmall_types..., Int, UInt))
@eval PR_isless(a::NTuple{2, $T}, b::NTuple{2, $T}) = isless(_pack_tuple(a), _pack_tuple(b))
end
getTestvals(T::Type{TT}) where {TT<:Unsigned} = [zero(T), one(T), 2one(T), typemax(T) ⊻ 3one(T), typemax(T) ⊻ one(T), typemax(T)]
getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), typemin(T)+1, typemin(T)+2, -3one(T), -2one(T), -one(T), zero(T), one(T), 2one(T), 3one(T), typemax(T)-3, typemax(T)-2, typemax(T)-1, typemax(T)]
@testset begin
for T in unique((BitIntegerSmall_types..., Int, UInt))
vals = getTestvals(T)
fails = Any[]
for val_1_1 in vals, val_2_1 in vals
for val_1_2 in vals, val_2_2 in vals
isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2)) || push!(fails,((val_1_1, val_1_2),(val_2_1, val_2_2)))
@test isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2))
end
end
println("$T, failures: $(length(fails))")
if !isempty(fails)
for x in fails
println("$x fails")
end
end
end
end
and benchmarks.using Base: BitIntegerSmall_types
using BenchmarkTools
import Base.isless
packabletypes = unique((BitIntegerSmall_types..., Int, UInt))
N = length(packabletypes)
n = 10^5
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)
# benchmark base
times1 = zeros(N)
for (i,T) in enumerate(packabletypes)
v = zip(rand(T,n),rand(T,n)) |> collect;
times1[i] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1
end
_pack_tuple((a,b)) = widen((a)) << 8sizeof(b) - typemin(b) + b
for T in unique((BitIntegerSmall_types..., Int, UInt))
@eval isless(a::NTuple{2, $T}, b::NTuple{2, $T}) = isless(_pack_tuple(a), _pack_tuple(b))
end
# benchmark modified
times2 = zeros(N)
for (i,T) in enumerate(packabletypes)
v = zip(rand(T,n),rand(T,n)) |> collect;
times2[i] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1
end
speedups = round.(times1./times2,sigdigits=3)
println("speedups:")
for i in 1:N
println("$(packabletypes[i]): $(speedups[i])")
end
|
base/tuple.jl
Outdated
_pack_tuple((a,b)) = widen(_pack_tuple_promote(a,b)) << 8sizeof(b) - typemin(b) + b | ||
let T = Union{Base.BitIntegerSmall, Int, UInt} | ||
function isless(a::Tuple{T, T}, b::Tuple{T, T}) | ||
typeof(a) == typeof(b) || return invoke(isless, NTuple{2, NTuple{2}}, a, b) | ||
isless(_pack_tuple(a), _pack_tuple(b)) | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_pack_tuple((a,b)) = widen(_pack_tuple_promote(a,b)) << 8sizeof(b) - typemin(b) + b | |
let T = Union{Base.BitIntegerSmall, Int, UInt} | |
function isless(a::Tuple{T, T}, b::Tuple{T, T}) | |
typeof(a) == typeof(b) || return invoke(isless, NTuple{2, NTuple{2}}, a, b) | |
isless(_pack_tuple(a), _pack_tuple(b)) | |
end | |
end | |
_pack_tuple((a,b)) = widen((a)) << 8sizeof(b) - typemin(b) + b | |
for T in unique((BitUnsignedSmall_types..., UInt)) | |
Ts = signed(T) | |
@eval isless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{$T,$Ts}, T2<:Union{$T,$Ts}} = isless(_pack_tuple(a), _pack_tuple(b)) | |
end |
How about this, which only redefines isless for tuples with elements of the same size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that isless((1, 1), (1, unsigned(2)))
would fail there because I misunderstood dispatch. I think tuples dispatch differently than other parametric types so this works. Thanks!
Agreed. I'm going to try reverting support for heterogeneous tuples. |
ab71fd8
to
559180f
Compare
559180f
to
c71a7d8
Compare
I think this would work for heterogeneous tuples (and even comparing two tuples with different first types) but I don't see much performance improvement in the heterogeneous case: function _pack_tuple((a,b))
wa = widen(a)
sizeof(wa) > sizeof(b) ? wa << 8sizeof(b) - typemin(b) + b : _pack_tuple((wa, b))
end
isless(a,b) = isless(a,b)
isless((ab::Vararg{Tuple{Union{BitIntegerSmall, Int, UInt}, T}, 2})) where T <: Union{BitIntegerSmall, Int, UInt} =
isless(_pack_tuple(ab[1]), _pack_tuple(ab[2])) |
Nice! test:using Test
using Base: BitIntegerSmall, BitIntegerSmall_types
import Base.isless
function _pack_tuple((a,b))
wa = widen(a)
sizeof(wa) > sizeof(b) ? wa << 8sizeof(b) - typemin(b) + b : _pack_tuple((wa, b))
end
PR_isless((ab::Vararg{Tuple{Union{BitIntegerSmall, Int, UInt}, T}, 2})) where T <: Union{BitIntegerSmall, Int, UInt} =
isless(_pack_tuple(ab[1]), _pack_tuple(ab[2]))
getTestvals(T::Type{TT}) where {TT<:Unsigned} = [zero(T), one(T), 2one(T), typemax(T) ⊻ 3one(T), typemax(T) ⊻ one(T), typemax(T)]
getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), typemin(T)+1, typemin(T)+2, -3one(T), -2one(T), -one(T), zero(T), one(T), 2one(T), 3one(T), typemax(T)-3, typemax(T)-2, typemax(T)-1, typemax(T)]
@testset begin
for T1 in unique((BitIntegerSmall_types..., Int, UInt)), T2 in unique((BitIntegerSmall_types..., Int, UInt))
vals1 = getTestvals(T1)
vals2 = getTestvals(T2)
for val_1_1 in vals1, val_2_1 in vals1
for val_1_2 in vals2, val_2_2 in vals2
@test isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2))
end
end
end
end
benchmark:using Base: BitIntegerSmall, BitIntegerSmall_types
using BenchmarkTools
import Base.isless
using NamedArrays
#packabletypes = unique((BitIntegerSmall_types..., Int, UInt))
packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64] # presorted for clear output
N = length(packabletypes)
n = 10^5
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)
# benchmark base
times1 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
v = zip(rand(T1,n),rand(T2,n)) |> collect;
times1[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1 seconds=1
end
function _pack_tuple((a,b))
wa = widen(a)
sizeof(wa) > sizeof(b) ? wa << 8sizeof(b) - typemin(b) + b : _pack_tuple((wa, b))
end
isless((ab::Vararg{Tuple{Union{BitIntegerSmall, Int, UInt}, T}, 2})) where T <: Union{BitIntegerSmall, Int, UInt} =
isless(_pack_tuple(ab[1]), _pack_tuple(ab[2]))
# benchmark modified
times2 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
v = zip(rand(T1,n),rand(T2,n)) |> collect;
times2[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1 seconds=1
end
speedup = NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup
|
Aside: it looks like LLVM optimizer does not know about this transform, though we can get closer to the PR behavior by hoisting the conditional evaluation of the indexing calls explicitly (it is the getindex call, not the isless that needs hosting, but here I hoist both): julia> code_llvm((NTuple{2,Int},NTuple{2,Int})) do x, y
ifelse(Base.isless(x[1],y[1]), true, (!Base.isless(y[1],x[1])) & Base.isless(x[2],y[2]))
end
; @ REPL[23]:2 within `#45`
define i8 @"julia_#45_311"([2 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(16) %0, [2 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(16) %1) #0 {
top:
%2 = call {}*** inttoptr (i64 140703307447525 to {}*** (i64)*)(i64 262) #1
%ptls_field3 = getelementptr inbounds {}**, {}*** %2, i64 2
%3 = bitcast {}*** %ptls_field3 to i64***
%ptls_load45 = load i64**, i64*** %3, align 8
%4 = getelementptr inbounds i64*, i64** %ptls_load45, i64 2
%safepoint = load i64*, i64** %4, align 8
fence syncscope("singlethread") seq_cst
%5 = load volatile i64, i64* %safepoint, align 8
fence syncscope("singlethread") seq_cst
; ┌ @ tuple.jl:31 within `getindex`
%6 = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 0
%7 = getelementptr inbounds [2 x i64], [2 x i64]* %1, i64 0, i64 0
; └
; ┌ @ operators.jl:421 within `isless`
; │┌ @ int.jl:83 within `<`
%8 = load i64, i64* %6, align 8
%9 = load i64, i64* %7, align 8
%.not = icmp slt i64 %8, %9
%10 = icmp sge i64 %9, %8
; └└
; ┌ @ tuple.jl:31 within `getindex`
%11 = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 1
%12 = getelementptr inbounds [2 x i64], [2 x i64]* %1, i64 0, i64 1
; └
; ┌ @ operators.jl:421 within `isless`
; │┌ @ int.jl:83 within `<`
%13 = load i64, i64* %11, align 8
%14 = load i64, i64* %12, align 8
%15 = icmp slt i64 %13, %14
; └└
; ┌ @ bool.jl:38 within `&`
%16 = and i1 %10, %15
; └
; ┌ @ essentials.jl:586 within `ifelse`
%narrow = select i1 %.not, i1 true, i1 %16
%17 = zext i1 %narrow to i8
; └
ret i8 %17
} |
That's the only combination I benchmarked, lol. Thanks for the more comprehensive treatment.
Oops. That is an artifact of my drafting process. Should not be included
The equivalent would actually be
That hoisting seems to be sufficient for performance. Running @LSchwerdt's benchmarks on my machine I find that this implementation is about as fast as the packing version while being the simplest and most general yet: isless(a::Tuple{BitInteger, BitInteger}, b::Tuple{BitInteger, BitInteger}) =
isless(a[1], b[1]) | (isequal(a[1], b[1]) & isless(a[2], b[2])) benchmark:using Base: BitInteger, BitInteger_types
using BenchmarkTools
import Base.isless
using NamedArrays
#packabletypes = unique((BitIntegerSmall_types..., Int, UInt))
packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64,UInt128,Int128] # presorted for clear output
N = length(packabletypes)
n = 10^5
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)
# benchmark base
times1 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
v = zip(rand(T1,n),rand(T2,n)) |> collect;
times1[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1 seconds=1
end
isless(a::Tuple{BitInteger, BitInteger}, b::Tuple{BitInteger, BitInteger}) =
isless(a[1], b[1]) | (isequal(a[1], b[1]) & isless(a[2], b[2]))
# benchmark modified
times2 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
v = zip(rand(T1,n),rand(T2,n)) |> collect;
times2[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1 seconds=1
end
speedup = NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup
|
You are right, I totally missed that, and your solution is even more clever than I thought. This isless(a::Tuple{BitInteger, BitInteger}, b::Tuple{BitInteger, BitInteger}) =
isless(a[1], b[1]) | (isequal(a[1], b[1]) & isless(a[2], b[2])) is great, because of its simplicity and generality, but I get slightly less performance using this compared to the explicitly packed version on my 3900x.
This is reflected in the assembly, where t = UInt64.((0,0))
@code_native isless(t,t) produces this:# %bb.0: # %top
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
; │┌ @ operators.jl:421 within `isless`
; ││┌ @ int.jl:487 within `<`
movq (%rcx), %rax
movq 8(%rcx), %rcx
cmpq (%rdx), %rax
setb %r8b
; │└└
; │┌ @ operators.jl:133 within `isequal`
; ││┌ @ promotion.jl:499 within `==`
sete %r9b
; │└└
; │┌ @ operators.jl:421 within `isless`
; ││┌ @ int.jl:487 within `<`
cmpq 8(%rdx), %rcx
setb %al
; │└└
; │┌ @ bool.jl:38 within `&`
andb %r9b, %al
; │└
; │┌ @ bool.jl:39 within `|`
orb %r8b, %al
; │└
popq %rbp
retq
.Lfunc_end0:
.size julia_PR_isless_1750, .Lfunc_end0-julia_PR_isless_1750
.cfi_endproc
; └
# -- End function while the packed version yields this:# %bb.0: # %top
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
; │┌ @ d:\Julia_dev\sorting\tuplecompare\benchPR48753_5.jl:45 within `_pack_tuple`
; ││┌ @ operators.jl:882 within `widen`
; │││┌ @ number.jl:7 within `convert`
; ││││┌ @ boot.jl:790 within `UInt128`
; │││││┌ @ boot.jl:775 within `toUInt128`
movq (%rcx), %rax
movq 8(%rcx), %rcx
; │└└└└└
; │┌ @ operators.jl:421 within `isless`
; ││┌ @ int.jl:487 within `<`
cmpq 8(%rdx), %rcx
sbbq (%rdx), %rax
setb %al
; │└└
popq %rbp
retq
.Lfunc_end0:
.size julia_PR_isless_packed_1989, .Lfunc_end0-julia_PR_isless_packed_1989
.cfi_endproc
; └
# -- End function The hoisted version gets its speedup from using the non-short-circuiting operators, but does not manage to emit the sbb (subtract with borrow) instruction to eliminate the check for equality. So for optimal performance both versions would need to be included. |
A rare but valuable opinion in the Julia community. I concur in this case. Are there any changes you think this needs before merging? |
Oh. To me, the combination of high performance AND beautiful and simple code is the essence of Julia. There are plenty of high performance languages with ugly(-er) code.
Everything looks fine to me. Optimization of the general case:If the manual precisely matches the implementation, and
is correct, the compiler can never optimize the generic version. To enable that,
would need to be true. It may currently be the case, but I have not found documentation saying that. |
The compiler is allowed to emit any instructions it wants to so long as those instructions fulfill the observable documented semantics (runtime does not count as observable). An extra julia> function f(a, b)
c = 0
for i in 1:a
c += b
end
c
end
f (generic function with 1 method)
julia> function g(a,b)
@elapsed f(a,b)
end
g (generic function with 1 method)
julia> g(10^10, 10^10)
1.2e-7
julia> @code_llvm f(1, 1)
; @ REPL[18]:1 within `f`
define i64 @julia_f_1808(i64 signext %0, i64 signext %1) #0 {
top:
; @ REPL[18]:3 within `f`
; ┌ @ range.jl:5 within `Colon`
; │┌ @ range.jl:397 within `UnitRange`
; ││┌ @ range.jl:404 within `unitrange_last`
%.inv = icmp sgt i64 %0, 0
%2 = mul i64 %1, %0
; └└└
%spec.select = select i1 %.inv, i64 %2, i64 0
; @ REPL[18]:6 within `f`
ret i64 %spec.select
} |
Thanks! I know C++ does it this way and the corresponding documentation is easy to find, but I did not find the corresponding documentation for Julia. |
Implements #48724
I have yet to reproduce benchmarks and thoroughly think through correctness, but this seems promising.
4 lines of code for a 2x speedup is worth it, even if sorting tuples is a fairly uncommon use-case. Hopefully the compiler folks can make this obsolete soon.
Closes #48724 cc @LSchwerdt.