Fastpath for comparing tuples of two bitintegers #48753

LilithHafner · 2023-02-22T04:43:36Z

Implements #48724

I have yet to reproduce benchmarks and thoroughly think through correctness, but this seems promising.

4 lines of code for a 2x speedup is worth it, even if sorting tuples is a fairly uncommon use-case. Hopefully the compiler folks can make this obsolete soon.

Closes #48724 cc @LSchwerdt.

LilithHafner · 2023-02-22T04:45:06Z

I don't expect this to report anything notable now, but we could add a benchmark for this case to nanosoldier.

@nanosoldier runbenchmarks("sort", vs=":master")

nanosoldier · 2023-02-22T04:53:10Z

Your job failed.

vtjnash · 2023-02-22T05:14:26Z

base/tuple.jl

+# TODO: remove this when the compiler can optimize the generic version better
+# See #48724 and #48753


Suggested change

# TODO: remove this when the compiler can optimize the generic version better

# See #48724 and #48753

I added this TODO to it make it clear to future readers that it is okay to remove this if it no longer produces measurable performance improvements. @vtjnash, why do you prefer to remove it?

LSchwerdt · 2023-02-22T08:34:11Z

Thank you for implementing this so quickly.

correctness

widen(a1) << 8sizeof($T) + a2 can underflow. I used widen(a1) << 8sizeof($T) | maybe_flipsignbit(a2) instead, which should be correct.

Test script.

using Test
using Base: BitIntegerSmall_types


for T in unique((BitIntegerSmall_types..., Int, UInt))
    @eval PR_isless((a1,a2)::NTuple{2, $T}, (b1,b2)::NTuple{2, $T}) =
        isless(widen(a1) << 8sizeof($T) + a2, widen(b1) << 8sizeof($T) + b2)
end


getTestvals(T::Type{TT}) where {TT<:Unsigned} = [zero(T), one(T), 2one(T), typemax(T) ⊻ 3one(T), typemax(T) ⊻ one(T), typemax(T)]
#getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), typemin(T)+1, typemin(T)+2, -3one(T), -2one(T), -one(T), zero(T), one(T), 2one(T), 3one(T), typemax(T)-3, typemax(T)-2, typemax(T)-1, typemax(T)]
getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), zero(T)]

@testset begin
for T in unique((BitIntegerSmall_types..., Int, UInt))
    vals = getTestvals(T)
    fails = Any[]
    for val_1_1 in vals, val_2_1 in vals
        for val_1_2 in vals, val_2_2 in vals
            isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2)) || push!(fails,((val_1_1, val_1_2),(val_2_1, val_2_2)))
            #@test isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2))
        end
    end
    println("$T, failures: $(length(fails))")
    if !isempty(fails)
        for x in fails
            println("$x fails")
        end
    end
end
end

implementation

I think simplifying the code by omitting the @generated function to handle different sizes of the two tuple items is a reasonable choice. But even without that, the code works for all tuples with two elements of the same size*.
Is there a clean way of dispatching on this? The naive aproach is to define methods for all combinations of Int and UInt, but I dont like that. If not, limiting this to NTuples is fine, imho.
*(actually it works for all tuples t where sizeof(t[1]) >= sizeof(t[2]))

LilithHafner · 2023-02-22T13:55:38Z

I fixed that correctness issue, thanks!

I also added support for heterogeneous tuples, but idk if it's worth the added complexity. I'm very open to reverting that commit.

LSchwerdt · 2023-02-22T14:45:29Z

base/tuple.jl

-for T in unique((BitIntegerSmall_types..., Int, UInt))
-    @eval isless(a::NTuple{2, $T}, b::NTuple{2, $T}) = isless(_pack_tuple(a), _pack_tuple(b))
+_pack_tuple_promote(a::Unsigned,b) = unsigned(promote(a,b)[1])
+_pack_tuple_promote(a::Signed,b) = signed(promote(a,b)[1])


I could not get the adaptive widening to work using promotion, thats why I used the @generated function.
This can result in calling promote(UInt8(200),Int8(-1)), which errors with InexactError: check_top_bit(UInt8, -1).

LSchwerdt · 2023-02-22T15:11:32Z

I also added support for heterogeneous tuples, but idk if it's worth the added complexity. I'm very open to reverting that commit.

I think only supporting tuples with elements of the same size is a reasonable compromise. But since this optimization is probably used very seldomly, only supporting NTuples is fine.

Commit cc07250 passes

my tests

using Test
using Base: BitIntegerSmall_types

_pack_tuple((a,b)) = widen((a)) << 8sizeof(b) - typemin(b) + b
for T in unique((BitIntegerSmall_types..., Int, UInt))
    @eval PR_isless(a::NTuple{2, $T}, b::NTuple{2, $T}) = isless(_pack_tuple(a), _pack_tuple(b))
end

getTestvals(T::Type{TT}) where {TT<:Unsigned} = [zero(T), one(T), 2one(T), typemax(T) ⊻ 3one(T), typemax(T) ⊻ one(T), typemax(T)]
getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), typemin(T)+1, typemin(T)+2, -3one(T), -2one(T), -one(T), zero(T), one(T), 2one(T), 3one(T), typemax(T)-3, typemax(T)-2, typemax(T)-1, typemax(T)]

@testset begin
for T in unique((BitIntegerSmall_types..., Int, UInt))
    vals = getTestvals(T)
    fails = Any[]
    for val_1_1 in vals, val_2_1 in vals
        for val_1_2 in vals, val_2_2 in vals
            isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2)) || push!(fails,((val_1_1, val_1_2),(val_2_1, val_2_2)))
            @test isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2))
        end
    end
    println("$T, failures: $(length(fails))")
    if !isempty(fails)
        for x in fails
            println("$x fails")
        end
    end
end
end

Test Summary: |   Pass   Total  Time
test set      | 158848  158848  0.4s

and benchmarks.

using Base: BitIntegerSmall_types
using BenchmarkTools
import Base.isless

packabletypes = unique((BitIntegerSmall_types..., Int, UInt))
N = length(packabletypes)
n = 10^5
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)

# benchmark base
times1 = zeros(N)
for (i,T) in enumerate(packabletypes)
    v = zip(rand(T,n),rand(T,n)) |> collect;
    times1[i] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1
end

_pack_tuple((a,b)) = widen((a)) << 8sizeof(b) - typemin(b) + b
for T in unique((BitIntegerSmall_types..., Int, UInt))
    @eval isless(a::NTuple{2, $T}, b::NTuple{2, $T}) = isless(_pack_tuple(a), _pack_tuple(b))
end

# benchmark modified
times2 = zeros(N)
for (i,T) in enumerate(packabletypes)
    v = zip(rand(T,n),rand(T,n)) |> collect;
    times2[i] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1
end

speedups = round.(times1./times2,sigdigits=3)

println("speedups:")
for i in 1:N
    println("$(packabletypes[i]): $(speedups[i])") 
end

speedups:
Int8: 1.93  
Int16: 2.04 
Int32: 1.9  
UInt8: 2.27 
UInt16: 2.31
UInt32: 2.17
Int64: 1.57 
UInt64: 1.99

LSchwerdt · 2023-02-22T15:21:49Z

base/tuple.jl

+_pack_tuple((a,b)) = widen(_pack_tuple_promote(a,b)) << 8sizeof(b) - typemin(b) + b
+let T = Union{Base.BitIntegerSmall, Int, UInt}
+    function isless(a::Tuple{T, T}, b::Tuple{T, T})
+        typeof(a) == typeof(b) || return invoke(isless, NTuple{2, NTuple{2}}, a, b)
+        isless(_pack_tuple(a), _pack_tuple(b))
+    end
+end


Suggested change

_pack_tuple((a,b)) = widen(_pack_tuple_promote(a,b)) << 8sizeof(b) - typemin(b) + b

let T = Union{Base.BitIntegerSmall, Int, UInt}

function isless(a::Tuple{T, T}, b::Tuple{T, T})

typeof(a) == typeof(b) || return invoke(isless, NTuple{2, NTuple{2}}, a, b)

isless(_pack_tuple(a), _pack_tuple(b))

end

end

_pack_tuple((a,b)) = widen((a)) << 8sizeof(b) - typemin(b) + b

for T in unique((BitUnsignedSmall_types..., UInt))

Ts = signed(T)

@eval isless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{$T,$Ts}, T2<:Union{$T,$Ts}} = isless(_pack_tuple(a), _pack_tuple(b))

end

How about this, which only redefines isless for tuples with elements of the same size?

I thought that isless((1, 1), (1, unsigned(2))) would fail there because I misunderstood dispatch. I think tuples dispatch differently than other parametric types so this works. Thanks!

LilithHafner · 2023-02-22T17:19:41Z

...since this optimization is probably used very seldomly, only supporting NTuples is fine.

Agreed. I'm going to try reverting support for heterogeneous tuples.

LilithHafner · 2023-02-22T19:51:00Z

I think this would work for heterogeneous tuples (and even comparing two tuples with different first types) but I don't see much performance improvement in the heterogeneous case:

function _pack_tuple((a,b))
    wa = widen(a)
    sizeof(wa) > sizeof(b) ? wa << 8sizeof(b) - typemin(b) + b : _pack_tuple((wa, b))
end
isless(a,b) = isless(a,b)
isless((ab::Vararg{Tuple{Union{BitIntegerSmall, Int, UInt}, T}, 2})) where T <: Union{BitIntegerSmall, Int, UInt} =
    isless(_pack_tuple(ab[1]), _pack_tuple(ab[2]))

LSchwerdt · 2023-02-23T10:09:13Z

I think this would work for heterogeneous tuples (and even comparing two tuples with different first types) but I don't see much performance improvement in the heterogeneous case:
function _pack_tuple((a,b))
    wa = widen(a)
    sizeof(wa) > sizeof(b) ? wa << 8sizeof(b) - typemin(b) + b : _pack_tuple((wa, b))
end
isless(a,b) = isless(a,b)
isless((ab::Vararg{Tuple{Union{BitIntegerSmall, Int, UInt}, T}, 2})) where T <: Union{BitIntegerSmall, Int, UInt} =
    isless(_pack_tuple(ab[1]), _pack_tuple(ab[2]))

Nice!
Apparently I underestimated the compiler, because in my testing this results in the same performance as my @generated version: Everything is sped up nicely when used inside sort, except combinations of 8 & 64 bit elements. isless for those shows about the same performance (maybe even a single-digit percentage regression), but it benefits from beeing used branchlessly, where it is at least 30% faster than base.

test:

using Test
using Base: BitIntegerSmall, BitIntegerSmall_types
import Base.isless

function _pack_tuple((a,b))
    wa = widen(a)
    sizeof(wa) > sizeof(b) ? wa << 8sizeof(b) - typemin(b) + b : _pack_tuple((wa, b))
end
PR_isless((ab::Vararg{Tuple{Union{BitIntegerSmall, Int, UInt}, T}, 2})) where T <: Union{BitIntegerSmall, Int, UInt} =
    isless(_pack_tuple(ab[1]), _pack_tuple(ab[2]))

getTestvals(T::Type{TT}) where {TT<:Unsigned} = [zero(T), one(T), 2one(T), typemax(T) ⊻ 3one(T), typemax(T) ⊻ one(T), typemax(T)]
getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), typemin(T)+1, typemin(T)+2, -3one(T), -2one(T), -one(T), zero(T), one(T), 2one(T), 3one(T), typemax(T)-3, typemax(T)-2, typemax(T)-1, typemax(T)]


@testset begin
    for T1 in unique((BitIntegerSmall_types..., Int, UInt)), T2 in unique((BitIntegerSmall_types..., Int, UInt))
        vals1 = getTestvals(T1)
        vals2 = getTestvals(T2)
        for val_1_1 in vals1, val_2_1 in vals1
            for val_1_2 in vals2, val_2_2 in vals2
                @test isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2))
            end
        end
    end
end

Test Summary: |   Pass   Total  Time
test set      | 861184  861184  0.9s

benchmark:

using Base: BitIntegerSmall, BitIntegerSmall_types
using BenchmarkTools
import Base.isless
using NamedArrays

#packabletypes = unique((BitIntegerSmall_types..., Int, UInt))
packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64] # presorted for clear output
N = length(packabletypes)
n = 10^5
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)

# benchmark base
times1 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    times1[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1 seconds=1
end


function _pack_tuple((a,b))
    wa = widen(a)
    sizeof(wa) > sizeof(b) ? wa << 8sizeof(b) - typemin(b) + b : _pack_tuple((wa, b))
end
isless((ab::Vararg{Tuple{Union{BitIntegerSmall, Int, UInt}, T}, 2})) where T <: Union{BitIntegerSmall, Int, UInt} =
    isless(_pack_tuple(ab[1]), _pack_tuple(ab[2]))

# benchmark modified
times2 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    times2[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1 seconds=1
end

speedup = NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup

speedup:
8×8 Named Matrix{Float64}
 A ╲ B │  UInt8    Int8  UInt16   Int16  UInt32   Int32  UInt64   Int64
───────┼───────────────────────────────────────────────────────────────
UInt8  │   2.17    2.13    2.17    2.07    1.79    1.74   0.983    1.02
Int8   │   2.05    1.96    1.99    1.86     1.7    1.62    1.02   0.987
UInt16 │    2.3    2.18    2.28    2.13    2.26     2.1    1.77    1.66
Int16  │   2.08    1.98     2.2    2.04    2.04     1.9    1.61     1.5
UInt32 │   1.84    1.72    2.15    1.97    2.14    2.03    1.98    1.74
Int32  │   1.74    1.64    1.88    1.81    1.88    1.82    1.76    1.53
UInt64 │   1.11     1.1    1.54    1.35    1.73     1.5    2.04    1.71
Int64  │    1.0    1.02    1.36    1.19    1.45    1.26    1.81    1.59

isless(a,b) = isless(a,b) ?
Using Vararg is clever, but I think isless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{...}, T2<:Union{...}} is easier to understand.

vtjnash · 2023-02-23T17:14:17Z

Aside: it looks like LLVM optimizer does not know about this transform, though we can get closer to the PR behavior by hoisting the conditional evaluation of the indexing calls explicitly (it is the getindex call, not the isless that needs hosting, but here I hoist both):

julia> code_llvm((NTuple{2,Int},NTuple{2,Int})) do x, y
       ifelse(Base.isless(x[1],y[1]), true, (!Base.isless(y[1],x[1])) & Base.isless(x[2],y[2]))
       end
;  @ REPL[23]:2 within `#45`
define i8 @"julia_#45_311"([2 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(16) %0, [2 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(16) %1) #0 {
top:
  %2 = call {}*** inttoptr (i64 140703307447525 to {}*** (i64)*)(i64 262) #1
  %ptls_field3 = getelementptr inbounds {}**, {}*** %2, i64 2
  %3 = bitcast {}*** %ptls_field3 to i64***
  %ptls_load45 = load i64**, i64*** %3, align 8
  %4 = getelementptr inbounds i64*, i64** %ptls_load45, i64 2
  %safepoint = load i64*, i64** %4, align 8
  fence syncscope("singlethread") seq_cst
  %5 = load volatile i64, i64* %safepoint, align 8
  fence syncscope("singlethread") seq_cst
; ┌ @ tuple.jl:31 within `getindex`
   %6 = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 0
   %7 = getelementptr inbounds [2 x i64], [2 x i64]* %1, i64 0, i64 0
; └
; ┌ @ operators.jl:421 within `isless`
; │┌ @ int.jl:83 within `<`
    %8 = load i64, i64* %6, align 8
    %9 = load i64, i64* %7, align 8
    %.not = icmp slt i64 %8, %9
    %10 = icmp sge i64 %9, %8
; └└
; ┌ @ tuple.jl:31 within `getindex`
   %11 = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 1
   %12 = getelementptr inbounds [2 x i64], [2 x i64]* %1, i64 0, i64 1
; └
; ┌ @ operators.jl:421 within `isless`
; │┌ @ int.jl:83 within `<`
    %13 = load i64, i64* %11, align 8
    %14 = load i64, i64* %12, align 8
    %15 = icmp slt i64 %13, %14
; └└
; ┌ @ bool.jl:38 within `&`
   %16 = and i1 %10, %15
; └
; ┌ @ essentials.jl:586 within `ifelse`
   %narrow = select i1 %.not, i1 true, i1 %16
   %17 = zext i1 %narrow to i8
; └
  ret i8 %17
}

LilithHafner · 2023-02-24T21:36:28Z

except combinations of 8 & 64 bit elements.

That's the only combination I benchmarked, lol. Thanks for the more comprehensive treatment.

isless(a,b) = isless(a,b) ?

Oops. That is an artifact of my drafting process. Should not be included

Using Vararg is clever, but I think isless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{...}, T2<:Union{...}} is easier to understand.

The equivalent would actually be isless(a::Tuple{<:Union{...}, T}, b::Tuple{<:Union{...}, T}) where T<:Union{...}, which demonstrates your point that Vararg is unclear

we can get closer to the PR behavior by hoisting the conditional evaluation of the indexing calls explicitly

That hoisting seems to be sufficient for performance. Running @LSchwerdt's benchmarks on my machine I find that this implementation is about as fast as the packing version while being the simplest and most general yet:

isless(a::Tuple{BitInteger, BitInteger}, b::Tuple{BitInteger, BitInteger}) =
    isless(a[1], b[1]) | (isequal(a[1], b[1]) & isless(a[2], b[2]))

benchmark:

using Base: BitInteger, BitInteger_types
using BenchmarkTools
import Base.isless
using NamedArrays

#packabletypes = unique((BitIntegerSmall_types..., Int, UInt))
packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64,UInt128,Int128] # presorted for clear output
N = length(packabletypes)
n = 10^5
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)

# benchmark base
times1 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    times1[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1 seconds=1
end


isless(a::Tuple{BitInteger, BitInteger}, b::Tuple{BitInteger, BitInteger}) =
    isless(a[1], b[1]) | (isequal(a[1], b[1]) & isless(a[2], b[2]))

# benchmark modified
times2 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    times2[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1 seconds=1
end

speedup = NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup

speedup:
10×10 Named Matrix{Float64}
  A ╲ B │   UInt8     Int8   UInt16    Int16   UInt32    Int32   UInt64    Int64  UInt128   Int128
────────┼─────────────────────────────────────────────────────────────────────────────────────────
UInt8   │    2.11      2.1     2.24     2.41     1.79      1.9     1.34     1.31     1.27     1.27
Int8    │    2.12     2.41      2.4     2.43     1.77     1.67      1.3      1.3     1.34      1.3
UInt16  │    2.35     2.16     1.98     2.02      2.1     2.11     1.87     1.84     1.42     1.41
Int16   │     1.9     1.82     1.63     1.71     1.89     1.74     1.44     1.47     1.36     1.31
UInt32  │    1.46     1.34      1.9     2.19     1.71     1.66     1.73     1.68     1.51      1.5
Int32   │    1.32      1.4     1.89     1.76     1.69     1.55     1.37     1.78     1.58     1.52
UInt64  │    1.02     1.03     1.41     1.52     1.58     1.79     1.54      1.6     1.57     1.54
Int64   │    1.03     1.05      1.5     1.43     1.66     1.64     1.66     1.45     1.63     1.63
UInt128 │    1.08     1.12     1.36     1.42     1.45     1.44     1.58     1.55     1.47     1.41
Int128  │    1.09     1.13     1.31     1.35     1.42     1.43     1.54     1.54     1.45     1.42

LSchwerdt · 2023-02-25T12:42:24Z

The equivalent would actually be isless(a::Tuple{<:Union{...}, T}, b::Tuple{<:Union{...}, T}) where T<:Union{...}, which demonstrates your point that Vararg is unclear

You are right, I totally missed that, and your solution is even more clever than I thought.

This

isless(a::Tuple{BitInteger, BitInteger}, b::Tuple{BitInteger, BitInteger}) =
    isless(a[1], b[1]) | (isequal(a[1], b[1]) & isless(a[2], b[2]))

is great, because of its simplicity and generality, but I get slightly less performance using this compared to the explicitly packed version on my 3900x.

speedup:
10×10 Named Matrix{Float64}
  A ╲ B │   UInt8     Int8   UInt16    Int16   UInt32    Int32   UInt64    Int64  UInt128   Int128
────────┼─────────────────────────────────────────────────────────────────────────────────────────
UInt8   │    1.62     1.64     1.57     1.63     1.48     1.46    0.923    0.918    0.977    0.969
Int8    │    1.65     1.66     1.68     1.68     1.47     1.48    0.925    0.956    0.977    0.994
UInt16  │    1.68     1.68     1.64     1.67     1.64     1.61     1.41     1.48      1.3     1.29
Int16   │    1.77     1.82     1.71     1.67     1.69     1.69     1.48     1.43     1.27     1.26
UInt32  │    1.53     1.59      1.6     1.62     1.72     1.68     1.59     1.58     1.57     1.53
Int32   │    1.59     1.64     1.62     1.61     1.74     1.67     1.64     1.63     1.55     1.57
UInt64  │     1.0     1.03     1.45     1.44     1.62     1.62     1.56     1.62     1.52     1.52
Int64   │     1.0      1.0     1.48     1.45     1.63     1.65     1.61     1.59     1.53     1.52
UInt128 │    1.02     1.03     1.35     1.33     1.55     1.52     1.46     1.46     1.36     1.36
Int128  │    0.98      1.0     1.33     1.35     1.51     1.55     1.47     1.45     1.34     1.37

This is reflected in the assembly, where

t = UInt64.((0,0))
@code_native isless(t,t)

produces this:

# %bb.0:                                # %top
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
; │┌ @ operators.jl:421 within `isless`
; ││┌ @ int.jl:487 within `<`
        movq    (%rcx), %rax
        movq    8(%rcx), %rcx
        cmpq    (%rdx), %rax
        setb    %r8b
; │└└
; │┌ @ operators.jl:133 within `isequal`
; ││┌ @ promotion.jl:499 within `==`
        sete    %r9b
; │└└
; │┌ @ operators.jl:421 within `isless`
; ││┌ @ int.jl:487 within `<`
        cmpq    8(%rdx), %rcx
        setb    %al
; │└└
; │┌ @ bool.jl:38 within `&`
        andb    %r9b, %al
; │└
; │┌ @ bool.jl:39 within `|`
        orb     %r8b, %al
; │└
        popq    %rbp
        retq
.Lfunc_end0:
        .size   julia_PR_isless_1750, .Lfunc_end0-julia_PR_isless_1750
        .cfi_endproc
; └
                                        # -- End function

while the packed version yields this:

# %bb.0:                                # %top
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
; │┌ @ d:\Julia_dev\sorting\tuplecompare\benchPR48753_5.jl:45 within `_pack_tuple`
; ││┌ @ operators.jl:882 within `widen`
; │││┌ @ number.jl:7 within `convert`
; ││││┌ @ boot.jl:790 within `UInt128`
; │││││┌ @ boot.jl:775 within `toUInt128`
        movq    (%rcx), %rax
        movq    8(%rcx), %rcx
; │└└└└└
; │┌ @ operators.jl:421 within `isless`
; ││┌ @ int.jl:487 within `<`
        cmpq    8(%rdx), %rcx
        sbbq    (%rdx), %rax
        setb    %al
; │└└
        popq    %rbp
        retq
.Lfunc_end0:
        .size   julia_PR_isless_packed_1989, .Lfunc_end0-julia_PR_isless_packed_1989
        .cfi_endproc
; └
                                        # -- End function

The hoisted version gets its speedup from using the non-short-circuiting operators, but does not manage to emit the sbb (subtract with borrow) instruction to eliminate the check for equality. So for optimal performance both versions would need to be included.
But sacrificing a little bit of performance for clear and concise code is a good choice here, in my opinion.

LilithHafner · 2023-02-26T00:59:35Z

But sacrificing a little bit of performance for clear and concise code is a good choice here, in my opinion.

A rare but valuable opinion in the Julia community. I concur in this case. Are there any changes you think this needs before merging?

LSchwerdt · 2023-02-26T14:54:41Z

A rare but valuable opinion in the Julia community.

Oh. To me, the combination of high performance AND beautiful and simple code is the essence of Julia. There are plenty of high performance languages with ugly(-er) code.
And it helps that I have found a better solution for my original problem of implementing a faster sortperm that does not need fast tuple comparisons. ;)

Are there any changes you think this needs before merging?

Everything looks fine to me.

Optimization of the general case:

If the manual precisely matches the implementation, and

In the expression a || b, the subexpression b is only evaluated if a evaluates to false.

is correct, the compiler can never optimize the generic version. To enable that,

If evaluating b is provably side-effect free, it may be evaluated in any case.

would need to be true. It may currently be the case, but I have not found documentation saying that.

LilithHafner · 2023-02-26T21:02:54Z

The compiler is allowed to emit any instructions it wants to so long as those instructions fulfill the observable documented semantics (runtime does not count as observable). An extra cmpq instruction has no observable side effects, so this is allowed. Here's an example of this general principle at work. A mul instruction, though not called for in the specification, is emitted regardless:

julia> function f(a, b)
           c = 0
           for i in 1:a
               c += b
           end
           c
       end
f (generic function with 1 method)

julia> function g(a,b)
           @elapsed f(a,b)
       end
g (generic function with 1 method)

julia> g(10^10, 10^10)
1.2e-7

julia> @code_llvm f(1, 1)
;  @ REPL[18]:1 within `f`
define i64 @julia_f_1808(i64 signext %0, i64 signext %1) #0 {
top:
;  @ REPL[18]:3 within `f`
; ┌ @ range.jl:5 within `Colon`
; │┌ @ range.jl:397 within `UnitRange`
; ││┌ @ range.jl:404 within `unitrange_last`
     %.inv = icmp sgt i64 %0, 0
     %2 = mul i64 %1, %0
; └└└
  %spec.select = select i1 %.inv, i64 %2, i64 0
;  @ REPL[18]:6 within `f`
  ret i64 %spec.select
}

LSchwerdt · 2023-02-27T07:55:04Z

Thanks! I know C++ does it this way and the corresponding documentation is easy to find, but I did not find the corresponding documentation for Julia.
But I guess it was stupid to think Julia may not work this way. That would be a total performance nightmare.

Lilith Hafner added 2 commits February 21, 2023 22:34

implement tuple packing

b9fcb61

add tests

9568397

LilithHafner added performance Must go faster sorting Put things in order labels Feb 22, 2023

LilithHafner added the potential benchmark Could make a good benchmark in BaseBenchmarks label Feb 22, 2023

vtjnash approved these changes Feb 22, 2023

View reviewed changes

vtjnash reviewed Feb 22, 2023

View reviewed changes

fix underflow bug

cc07250

LSchwerdt reviewed Feb 22, 2023

View reviewed changes

add more tests

2ebda00

LilithHafner force-pushed the integer-tuple-isless branch from ab71fd8 to 559180f Compare February 22, 2023 17:20

simplify

c71a7d8

LilithHafner force-pushed the integer-tuple-isless branch from 559180f to c71a7d8 Compare February 22, 2023 17:20

fix typo

d85e02b

move to opperators.jl for bootstrapping

bd2d797

Lilith Hafner added 3 commits February 24, 2023 15:38

simplify and generalize

94ab020

move tests from tuples.jl to operators.jl to match src

3136f4e

update comment

62ba600

KristofferC merged commit bfacf2c into master Feb 27, 2023

KristofferC deleted the integer-tuple-isless branch February 27, 2023 08:30

LilithHafner mentioned this pull request Feb 27, 2023

Document which program transformations the compiler is allowed to make #48818

Open

mkitti pushed a commit to mkitti/julia that referenced this pull request Feb 27, 2023

Fastpath for comparing tuples of two bitintegers (JuliaLang#48753)

cf20620

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fastpath for comparing tuples of two bitintegers #48753

Fastpath for comparing tuples of two bitintegers #48753

LilithHafner commented Feb 22, 2023

LilithHafner commented Feb 22, 2023

nanosoldier commented Feb 22, 2023

vtjnash Feb 22, 2023

LilithHafner Feb 26, 2023

LSchwerdt commented Feb 22, 2023

LilithHafner commented Feb 22, 2023

LSchwerdt Feb 22, 2023

LSchwerdt commented Feb 22, 2023

LSchwerdt Feb 22, 2023

LilithHafner Feb 22, 2023

LilithHafner commented Feb 22, 2023

LilithHafner commented Feb 22, 2023

LSchwerdt commented Feb 23, 2023

vtjnash commented Feb 23, 2023

LilithHafner commented Feb 24, 2023

LSchwerdt commented Feb 25, 2023

LilithHafner commented Feb 26, 2023

LSchwerdt commented Feb 26, 2023

LilithHafner commented Feb 26, 2023

LSchwerdt commented Feb 27, 2023

		# TODO: remove this when the compiler can optimize the generic version better
		# See #48724 and #48753

Fastpath for comparing tuples of two bitintegers #48753

Fastpath for comparing tuples of two bitintegers #48753

Conversation

LilithHafner commented Feb 22, 2023

LilithHafner commented Feb 22, 2023

nanosoldier commented Feb 22, 2023

vtjnash Feb 22, 2023

Choose a reason for hiding this comment

LilithHafner Feb 26, 2023

Choose a reason for hiding this comment

LSchwerdt commented Feb 22, 2023

correctness

implementation

LilithHafner commented Feb 22, 2023

LSchwerdt Feb 22, 2023

Choose a reason for hiding this comment

LSchwerdt commented Feb 22, 2023

LSchwerdt Feb 22, 2023

Choose a reason for hiding this comment

LilithHafner Feb 22, 2023

Choose a reason for hiding this comment

LilithHafner commented Feb 22, 2023

LilithHafner commented Feb 22, 2023

LSchwerdt commented Feb 23, 2023

vtjnash commented Feb 23, 2023

LilithHafner commented Feb 24, 2023

LSchwerdt commented Feb 25, 2023

LilithHafner commented Feb 26, 2023

LSchwerdt commented Feb 26, 2023

Optimization of the general case:

LilithHafner commented Feb 26, 2023

LSchwerdt commented Feb 27, 2023