Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastpath for comparing tuples of two bitintegers #48753

Merged
merged 10 commits into from
Feb 27, 2023
Merged

Conversation

LilithHafner
Copy link
Member

Implements #48724

I have yet to reproduce benchmarks and thoroughly think through correctness, but this seems promising.

4 lines of code for a 2x speedup is worth it, even if sorting tuples is a fairly uncommon use-case. Hopefully the compiler folks can make this obsolete soon.

Closes #48724 cc @LSchwerdt.

@LilithHafner LilithHafner added performance Must go faster sorting Put things in order labels Feb 22, 2023
@LilithHafner
Copy link
Member Author

I don't expect this to report anything notable now, but we could add a benchmark for this case to nanosoldier.

@nanosoldier runbenchmarks("sort", vs=":master")

@LilithHafner LilithHafner added the potential benchmark Could make a good benchmark in BaseBenchmarks label Feb 22, 2023
@nanosoldier
Copy link
Collaborator

Your job failed.

base/tuple.jl Outdated
Comment on lines 570 to 571
# TODO: remove this when the compiler can optimize the generic version better
# See #48724 and #48753
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: remove this when the compiler can optimize the generic version better
# See #48724 and #48753

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this TODO to it make it clear to future readers that it is okay to remove this if it no longer produces measurable performance improvements. @vtjnash, why do you prefer to remove it?

@LSchwerdt
Copy link
Contributor

Thank you for implementing this so quickly.

correctness

widen(a1) << 8sizeof($T) + a2 can underflow. I used widen(a1) << 8sizeof($T) | maybe_flipsignbit(a2) instead, which should be correct.

Test script.
using Test
using Base: BitIntegerSmall_types


for T in unique((BitIntegerSmall_types..., Int, UInt))
    @eval PR_isless((a1,a2)::NTuple{2, $T}, (b1,b2)::NTuple{2, $T}) =
        isless(widen(a1) << 8sizeof($T) + a2, widen(b1) << 8sizeof($T) + b2)
end


getTestvals(T::Type{TT}) where {TT<:Unsigned} = [zero(T), one(T), 2one(T), typemax(T)  3one(T), typemax(T)  one(T), typemax(T)]
#getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), typemin(T)+1, typemin(T)+2, -3one(T), -2one(T), -one(T), zero(T), one(T), 2one(T), 3one(T), typemax(T)-3, typemax(T)-2, typemax(T)-1, typemax(T)]
getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), zero(T)]

@testset begin
for T in unique((BitIntegerSmall_types..., Int, UInt))
    vals = getTestvals(T)
    fails = Any[]
    for val_1_1 in vals, val_2_1 in vals
        for val_1_2 in vals, val_2_2 in vals
            isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2)) || push!(fails,((val_1_1, val_1_2),(val_2_1, val_2_2)))
            #@test isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2))
        end
    end
    println("$T, failures: $(length(fails))")
    if !isempty(fails)
        for x in fails
            println("$x fails")
        end
    end
end
end

implementation

I think simplifying the code by omitting the @generated function to handle different sizes of the two tuple items is a reasonable choice. But even without that, the code works for all tuples with two elements of the same size*.
Is there a clean way of dispatching on this? The naive aproach is to define methods for all combinations of Int and UInt, but I dont like that. If not, limiting this to NTuples is fine, imho.
*(actually it works for all tuples t where sizeof(t[1]) >= sizeof(t[2]))

@LilithHafner
Copy link
Member Author

I fixed that correctness issue, thanks!

I also added support for heterogeneous tuples, but idk if it's worth the added complexity. I'm very open to reverting that commit.

base/tuple.jl Outdated
for T in unique((BitIntegerSmall_types..., Int, UInt))
@eval isless(a::NTuple{2, $T}, b::NTuple{2, $T}) = isless(_pack_tuple(a), _pack_tuple(b))
_pack_tuple_promote(a::Unsigned,b) = unsigned(promote(a,b)[1])
_pack_tuple_promote(a::Signed,b) = signed(promote(a,b)[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not get the adaptive widening to work using promotion, thats why I used the @generated function.
This can result in calling promote(UInt8(200),Int8(-1)), which errors with InexactError: check_top_bit(UInt8, -1).

@LSchwerdt
Copy link
Contributor

I also added support for heterogeneous tuples, but idk if it's worth the added complexity. I'm very open to reverting that commit.

I think only supporting tuples with elements of the same size is a reasonable compromise. But since this optimization is probably used very seldomly, only supporting NTuples is fine.

Commit cc07250 passes

my tests
using Test
using Base: BitIntegerSmall_types

_pack_tuple((a,b)) = widen((a)) << 8sizeof(b) - typemin(b) + b
for T in unique((BitIntegerSmall_types..., Int, UInt))
    @eval PR_isless(a::NTuple{2, $T}, b::NTuple{2, $T}) = isless(_pack_tuple(a), _pack_tuple(b))
end

getTestvals(T::Type{TT}) where {TT<:Unsigned} = [zero(T), one(T), 2one(T), typemax(T)  3one(T), typemax(T)  one(T), typemax(T)]
getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), typemin(T)+1, typemin(T)+2, -3one(T), -2one(T), -one(T), zero(T), one(T), 2one(T), 3one(T), typemax(T)-3, typemax(T)-2, typemax(T)-1, typemax(T)]

@testset begin
for T in unique((BitIntegerSmall_types..., Int, UInt))
    vals = getTestvals(T)
    fails = Any[]
    for val_1_1 in vals, val_2_1 in vals
        for val_1_2 in vals, val_2_2 in vals
            isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2)) || push!(fails,((val_1_1, val_1_2),(val_2_1, val_2_2)))
            @test isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2))
        end
    end
    println("$T, failures: $(length(fails))")
    if !isempty(fails)
        for x in fails
            println("$x fails")
        end
    end
end
end
Test Summary: |   Pass   Total  Time
test set      | 158848  158848  0.4s
and benchmarks.
using Base: BitIntegerSmall_types
using BenchmarkTools
import Base.isless

packabletypes = unique((BitIntegerSmall_types..., Int, UInt))
N = length(packabletypes)
n = 10^5
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)

# benchmark base
times1 = zeros(N)
for (i,T) in enumerate(packabletypes)
    v = zip(rand(T,n),rand(T,n)) |> collect;
    times1[i] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1
end

_pack_tuple((a,b)) = widen((a)) << 8sizeof(b) - typemin(b) + b
for T in unique((BitIntegerSmall_types..., Int, UInt))
    @eval isless(a::NTuple{2, $T}, b::NTuple{2, $T}) = isless(_pack_tuple(a), _pack_tuple(b))
end

# benchmark modified
times2 = zeros(N)
for (i,T) in enumerate(packabletypes)
    v = zip(rand(T,n),rand(T,n)) |> collect;
    times2[i] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1
end

speedups = round.(times1./times2,sigdigits=3)

println("speedups:")
for i in 1:N
    println("$(packabletypes[i]): $(speedups[i])") 
end
speedups:
Int8: 1.93  
Int16: 2.04 
Int32: 1.9  
UInt8: 2.27 
UInt16: 2.31
UInt32: 2.17
Int64: 1.57 
UInt64: 1.99

base/tuple.jl Outdated
Comment on lines 574 to 580
_pack_tuple((a,b)) = widen(_pack_tuple_promote(a,b)) << 8sizeof(b) - typemin(b) + b
let T = Union{Base.BitIntegerSmall, Int, UInt}
function isless(a::Tuple{T, T}, b::Tuple{T, T})
typeof(a) == typeof(b) || return invoke(isless, NTuple{2, NTuple{2}}, a, b)
isless(_pack_tuple(a), _pack_tuple(b))
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_pack_tuple((a,b)) = widen(_pack_tuple_promote(a,b)) << 8sizeof(b) - typemin(b) + b
let T = Union{Base.BitIntegerSmall, Int, UInt}
function isless(a::Tuple{T, T}, b::Tuple{T, T})
typeof(a) == typeof(b) || return invoke(isless, NTuple{2, NTuple{2}}, a, b)
isless(_pack_tuple(a), _pack_tuple(b))
end
end
_pack_tuple((a,b)) = widen((a)) << 8sizeof(b) - typemin(b) + b
for T in unique((BitUnsignedSmall_types..., UInt))
Ts = signed(T)
@eval isless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{$T,$Ts}, T2<:Union{$T,$Ts}} = isless(_pack_tuple(a), _pack_tuple(b))
end

How about this, which only redefines isless for tuples with elements of the same size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that isless((1, 1), (1, unsigned(2))) would fail there because I misunderstood dispatch. I think tuples dispatch differently than other parametric types so this works. Thanks!

@LilithHafner
Copy link
Member Author

...since this optimization is probably used very seldomly, only supporting NTuples is fine.

Agreed. I'm going to try reverting support for heterogeneous tuples.

@LilithHafner
Copy link
Member Author

I think this would work for heterogeneous tuples (and even comparing two tuples with different first types) but I don't see much performance improvement in the heterogeneous case:

function _pack_tuple((a,b))
    wa = widen(a)
    sizeof(wa) > sizeof(b) ? wa << 8sizeof(b) - typemin(b) + b : _pack_tuple((wa, b))
end
isless(a,b) = isless(a,b)
isless((ab::Vararg{Tuple{Union{BitIntegerSmall, Int, UInt}, T}, 2})) where T <: Union{BitIntegerSmall, Int, UInt} =
    isless(_pack_tuple(ab[1]), _pack_tuple(ab[2]))

@LSchwerdt
Copy link
Contributor

I think this would work for heterogeneous tuples (and even comparing two tuples with different first types) but I don't see much performance improvement in the heterogeneous case:

function _pack_tuple((a,b))
    wa = widen(a)
    sizeof(wa) > sizeof(b) ? wa << 8sizeof(b) - typemin(b) + b : _pack_tuple((wa, b))
end
isless(a,b) = isless(a,b)
isless((ab::Vararg{Tuple{Union{BitIntegerSmall, Int, UInt}, T}, 2})) where T <: Union{BitIntegerSmall, Int, UInt} =
    isless(_pack_tuple(ab[1]), _pack_tuple(ab[2]))

Nice!
Apparently I underestimated the compiler, because in my testing this results in the same performance as my @generated version: Everything is sped up nicely when used inside sort, except combinations of 8 & 64 bit elements. isless for those shows about the same performance (maybe even a single-digit percentage regression), but it benefits from beeing used branchlessly, where it is at least 30% faster than base.

test:
using Test
using Base: BitIntegerSmall, BitIntegerSmall_types
import Base.isless

function _pack_tuple((a,b))
    wa = widen(a)
    sizeof(wa) > sizeof(b) ? wa << 8sizeof(b) - typemin(b) + b : _pack_tuple((wa, b))
end
PR_isless((ab::Vararg{Tuple{Union{BitIntegerSmall, Int, UInt}, T}, 2})) where T <: Union{BitIntegerSmall, Int, UInt} =
    isless(_pack_tuple(ab[1]), _pack_tuple(ab[2]))

getTestvals(T::Type{TT}) where {TT<:Unsigned} = [zero(T), one(T), 2one(T), typemax(T)  3one(T), typemax(T)  one(T), typemax(T)]
getTestvals(T::Type{TT}) where {TT<:Signed} = [typemin(T), typemin(T)+1, typemin(T)+2, -3one(T), -2one(T), -one(T), zero(T), one(T), 2one(T), 3one(T), typemax(T)-3, typemax(T)-2, typemax(T)-1, typemax(T)]


@testset begin
    for T1 in unique((BitIntegerSmall_types..., Int, UInt)), T2 in unique((BitIntegerSmall_types..., Int, UInt))
        vals1 = getTestvals(T1)
        vals2 = getTestvals(T2)
        for val_1_1 in vals1, val_2_1 in vals1
            for val_1_2 in vals2, val_2_2 in vals2
                @test isless((val_1_1,val_1_2),(val_2_1,val_2_2)) == PR_isless((val_1_1,val_1_2),(val_2_1,val_2_2))
            end
        end
    end
end
Test Summary: |   Pass   Total  Time
test set      | 861184  861184  0.9s
benchmark:
using Base: BitIntegerSmall, BitIntegerSmall_types
using BenchmarkTools
import Base.isless
using NamedArrays

#packabletypes = unique((BitIntegerSmall_types..., Int, UInt))
packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64] # presorted for clear output
N = length(packabletypes)
n = 10^5
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)

# benchmark base
times1 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    times1[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1 seconds=1
end


function _pack_tuple((a,b))
    wa = widen(a)
    sizeof(wa) > sizeof(b) ? wa << 8sizeof(b) - typemin(b) + b : _pack_tuple((wa, b))
end
isless((ab::Vararg{Tuple{Union{BitIntegerSmall, Int, UInt}, T}, 2})) where T <: Union{BitIntegerSmall, Int, UInt} =
    isless(_pack_tuple(ab[1]), _pack_tuple(ab[2]))

# benchmark modified
times2 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    times2[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1 seconds=1
end

speedup = NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup
speedup:
8×8 Named Matrix{Float64}
 A ╲ B │  UInt8    Int8  UInt16   Int16  UInt32   Int32  UInt64   Int64
───────┼───────────────────────────────────────────────────────────────
UInt8  │   2.17    2.13    2.17    2.07    1.79    1.74   0.983    1.02
Int8   │   2.05    1.96    1.99    1.86     1.7    1.62    1.02   0.987
UInt16 │    2.3    2.18    2.28    2.13    2.26     2.1    1.77    1.66
Int16  │   2.08    1.98     2.2    2.04    2.04     1.9    1.61     1.5
UInt32 │   1.84    1.72    2.15    1.97    2.14    2.03    1.98    1.74
Int32  │   1.74    1.64    1.88    1.81    1.88    1.82    1.76    1.53
UInt64 │   1.11     1.1    1.54    1.35    1.73     1.5    2.04    1.71
Int64  │    1.0    1.02    1.36    1.19    1.45    1.26    1.81    1.59
  • isless(a,b) = isless(a,b) ?
  • Using Vararg is clever, but I think isless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{...}, T2<:Union{...}} is easier to understand.

@vtjnash
Copy link
Member

vtjnash commented Feb 23, 2023

Aside: it looks like LLVM optimizer does not know about this transform, though we can get closer to the PR behavior by hoisting the conditional evaluation of the indexing calls explicitly (it is the getindex call, not the isless that needs hosting, but here I hoist both):

julia> code_llvm((NTuple{2,Int},NTuple{2,Int})) do x, y
       ifelse(Base.isless(x[1],y[1]), true, (!Base.isless(y[1],x[1])) & Base.isless(x[2],y[2]))
       end
;  @ REPL[23]:2 within `#45`
define i8 @"julia_#45_311"([2 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(16) %0, [2 x i64]* nocapture noundef nonnull readonly align 8 dereferenceable(16) %1) #0 {
top:
  %2 = call {}*** inttoptr (i64 140703307447525 to {}*** (i64)*)(i64 262) #1
  %ptls_field3 = getelementptr inbounds {}**, {}*** %2, i64 2
  %3 = bitcast {}*** %ptls_field3 to i64***
  %ptls_load45 = load i64**, i64*** %3, align 8
  %4 = getelementptr inbounds i64*, i64** %ptls_load45, i64 2
  %safepoint = load i64*, i64** %4, align 8
  fence syncscope("singlethread") seq_cst
  %5 = load volatile i64, i64* %safepoint, align 8
  fence syncscope("singlethread") seq_cst
; ┌ @ tuple.jl:31 within `getindex`
   %6 = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 0
   %7 = getelementptr inbounds [2 x i64], [2 x i64]* %1, i64 0, i64 0
; └
; ┌ @ operators.jl:421 within `isless`
; │┌ @ int.jl:83 within `<`
    %8 = load i64, i64* %6, align 8
    %9 = load i64, i64* %7, align 8
    %.not = icmp slt i64 %8, %9
    %10 = icmp sge i64 %9, %8
; └└
; ┌ @ tuple.jl:31 within `getindex`
   %11 = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 1
   %12 = getelementptr inbounds [2 x i64], [2 x i64]* %1, i64 0, i64 1
; └
; ┌ @ operators.jl:421 within `isless`
; │┌ @ int.jl:83 within `<`
    %13 = load i64, i64* %11, align 8
    %14 = load i64, i64* %12, align 8
    %15 = icmp slt i64 %13, %14
; └└
; ┌ @ bool.jl:38 within `&`
   %16 = and i1 %10, %15
; └
; ┌ @ essentials.jl:586 within `ifelse`
   %narrow = select i1 %.not, i1 true, i1 %16
   %17 = zext i1 %narrow to i8
; └
  ret i8 %17
}

@LilithHafner
Copy link
Member Author

except combinations of 8 & 64 bit elements.

That's the only combination I benchmarked, lol. Thanks for the more comprehensive treatment.

isless(a,b) = isless(a,b) ?

Oops. That is an artifact of my drafting process. Should not be included

Using Vararg is clever, but I think isless(a::Tuple{T1, T2}, b::Tuple{T1, T2}) where {T1<:Union{...}, T2<:Union{...}} is easier to understand.

The equivalent would actually be isless(a::Tuple{<:Union{...}, T}, b::Tuple{<:Union{...}, T}) where T<:Union{...}, which demonstrates your point that Vararg is unclear

we can get closer to the PR behavior by hoisting the conditional evaluation of the indexing calls explicitly

That hoisting seems to be sufficient for performance. Running @LSchwerdt's benchmarks on my machine I find that this implementation is about as fast as the packing version while being the simplest and most general yet:

isless(a::Tuple{BitInteger, BitInteger}, b::Tuple{BitInteger, BitInteger}) =
    isless(a[1], b[1]) | (isequal(a[1], b[1]) & isless(a[2], b[2]))
benchmark:
using Base: BitInteger, BitInteger_types
using BenchmarkTools
import Base.isless
using NamedArrays

#packabletypes = unique((BitIntegerSmall_types..., Int, UInt))
packabletypes = [UInt8,Int8,UInt16,Int16,UInt32,Int32,UInt64,Int64,UInt128,Int128] # presorted for clear output
N = length(packabletypes)
n = 10^5
randomTuples!(v::AbstractArray{Tuple{T1,T2}}) where {T1,T2} = map!(_->(rand(T1),rand(T2)),v,v)

# benchmark base
times1 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    times1[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1 seconds=1
end


isless(a::Tuple{BitInteger, BitInteger}, b::Tuple{BitInteger, BitInteger}) =
    isless(a[1], b[1]) | (isequal(a[1], b[1]) & isless(a[2], b[2]))

# benchmark modified
times2 = zeros(N,N)
for (i,T1) in enumerate(packabletypes), (j,T2) in enumerate(packabletypes)
    v = zip(rand(T1,n),rand(T2,n)) |> collect;
    times2[i,j] = @belapsed sort!($v) setup=(randomTuples!($v)) evals=1 seconds=1
end

speedup = NamedArray(round.(times1./times2,sigdigits=3))
setnames!(speedup, string.(packabletypes), 1)
setnames!(speedup, string.(packabletypes), 2)
println("\nspeedup:")
speedup
speedup:
10×10 Named Matrix{Float64}
  A ╲ B │   UInt8     Int8   UInt16    Int16   UInt32    Int32   UInt64    Int64  UInt128   Int128
────────┼─────────────────────────────────────────────────────────────────────────────────────────
UInt8   │    2.11      2.1     2.24     2.41     1.79      1.9     1.34     1.31     1.27     1.27
Int8    │    2.12     2.41      2.4     2.43     1.77     1.67      1.3      1.3     1.34      1.3
UInt16  │    2.35     2.16     1.98     2.02      2.1     2.11     1.87     1.84     1.42     1.41
Int16   │     1.9     1.82     1.63     1.71     1.89     1.74     1.44     1.47     1.36     1.31
UInt32  │    1.46     1.34      1.9     2.19     1.71     1.66     1.73     1.68     1.51      1.5
Int32   │    1.32      1.4     1.89     1.76     1.69     1.55     1.37     1.78     1.58     1.52
UInt64  │    1.02     1.03     1.41     1.52     1.58     1.79     1.54      1.6     1.57     1.54
Int64   │    1.03     1.05      1.5     1.43     1.66     1.64     1.66     1.45     1.63     1.63
UInt128 │    1.08     1.12     1.36     1.42     1.45     1.44     1.58     1.55     1.47     1.41
Int128  │    1.09     1.13     1.31     1.35     1.42     1.43     1.54     1.54     1.45     1.42

@LSchwerdt
Copy link
Contributor

The equivalent would actually be isless(a::Tuple{<:Union{...}, T}, b::Tuple{<:Union{...}, T}) where T<:Union{...}, which demonstrates your point that Vararg is unclear

You are right, I totally missed that, and your solution is even more clever than I thought.

This

isless(a::Tuple{BitInteger, BitInteger}, b::Tuple{BitInteger, BitInteger}) =
    isless(a[1], b[1]) | (isequal(a[1], b[1]) & isless(a[2], b[2]))

is great, because of its simplicity and generality, but I get slightly less performance using this compared to the explicitly packed version on my 3900x.

speedup:
10×10 Named Matrix{Float64}
  A ╲ B │   UInt8     Int8   UInt16    Int16   UInt32    Int32   UInt64    Int64  UInt128   Int128
────────┼─────────────────────────────────────────────────────────────────────────────────────────
UInt8   │    1.62     1.64     1.57     1.63     1.48     1.46    0.923    0.918    0.977    0.969
Int8    │    1.65     1.66     1.68     1.68     1.47     1.48    0.925    0.956    0.977    0.994
UInt16  │    1.68     1.68     1.64     1.67     1.64     1.61     1.41     1.48      1.3     1.29
Int16   │    1.77     1.82     1.71     1.67     1.69     1.69     1.48     1.43     1.27     1.26
UInt32  │    1.53     1.59      1.6     1.62     1.72     1.68     1.59     1.58     1.57     1.53
Int32   │    1.59     1.64     1.62     1.61     1.74     1.67     1.64     1.63     1.55     1.57
UInt64  │     1.0     1.03     1.45     1.44     1.62     1.62     1.56     1.62     1.52     1.52
Int64   │     1.0      1.0     1.48     1.45     1.63     1.65     1.61     1.59     1.53     1.52
UInt128 │    1.02     1.03     1.35     1.33     1.55     1.52     1.46     1.46     1.36     1.36
Int128  │    0.98      1.0     1.33     1.35     1.51     1.55     1.47     1.45     1.34     1.37

This is reflected in the assembly, where

t = UInt64.((0,0))
@code_native isless(t,t)
produces this:
# %bb.0:                                # %top
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
; │┌ @ operators.jl:421 within `isless`
; ││┌ @ int.jl:487 within `<`
        movq    (%rcx), %rax
        movq    8(%rcx), %rcx
        cmpq    (%rdx), %rax
        setb    %r8b
; │└└
; │┌ @ operators.jl:133 within `isequal`
; ││┌ @ promotion.jl:499 within `==`
        sete    %r9b
; │└└
; │┌ @ operators.jl:421 within `isless`
; ││┌ @ int.jl:487 within `<`
        cmpq    8(%rdx), %rcx
        setb    %al
; │└└
; │┌ @ bool.jl:38 within `&`
        andb    %r9b, %al
; │└
; │┌ @ bool.jl:39 within `|`
        orb     %r8b, %al
; │└
        popq    %rbp
        retq
.Lfunc_end0:
        .size   julia_PR_isless_1750, .Lfunc_end0-julia_PR_isless_1750
        .cfi_endproc
; └
                                        # -- End function
while the packed version yields this:
# %bb.0:                                # %top
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
; │┌ @ d:\Julia_dev\sorting\tuplecompare\benchPR48753_5.jl:45 within `_pack_tuple`
; ││┌ @ operators.jl:882 within `widen`
; │││┌ @ number.jl:7 within `convert`
; ││││┌ @ boot.jl:790 within `UInt128`
; │││││┌ @ boot.jl:775 within `toUInt128`
        movq    (%rcx), %rax
        movq    8(%rcx), %rcx
; │└└└└└
; │┌ @ operators.jl:421 within `isless`
; ││┌ @ int.jl:487 within `<`
        cmpq    8(%rdx), %rcx
        sbbq    (%rdx), %rax
        setb    %al
; │└└
        popq    %rbp
        retq
.Lfunc_end0:
        .size   julia_PR_isless_packed_1989, .Lfunc_end0-julia_PR_isless_packed_1989
        .cfi_endproc
; └
                                        # -- End function

The hoisted version gets its speedup from using the non-short-circuiting operators, but does not manage to emit the sbb (subtract with borrow) instruction to eliminate the check for equality. So for optimal performance both versions would need to be included.
But sacrificing a little bit of performance for clear and concise code is a good choice here, in my opinion.

@LilithHafner
Copy link
Member Author

But sacrificing a little bit of performance for clear and concise code is a good choice here, in my opinion.

A rare but valuable opinion in the Julia community. I concur in this case. Are there any changes you think this needs before merging?

@LSchwerdt
Copy link
Contributor

A rare but valuable opinion in the Julia community.

Oh. To me, the combination of high performance AND beautiful and simple code is the essence of Julia. There are plenty of high performance languages with ugly(-er) code.
And it helps that I have found a better solution for my original problem of implementing a faster sortperm that does not need fast tuple comparisons. ;)

Are there any changes you think this needs before merging?

Everything looks fine to me.

Optimization of the general case:

If the manual precisely matches the implementation, and

In the expression a || b, the subexpression b is only evaluated if a evaluates to false.

is correct, the compiler can never optimize the generic version. To enable that,

If evaluating b is provably side-effect free, it may be evaluated in any case.

would need to be true. It may currently be the case, but I have not found documentation saying that.

@LilithHafner
Copy link
Member Author

The compiler is allowed to emit any instructions it wants to so long as those instructions fulfill the observable documented semantics (runtime does not count as observable). An extra cmpq instruction has no observable side effects, so this is allowed. Here's an example of this general principle at work. A mul instruction, though not called for in the specification, is emitted regardless:

julia> function f(a, b)
           c = 0
           for i in 1:a
               c += b
           end
           c
       end
f (generic function with 1 method)

julia> function g(a,b)
           @elapsed f(a,b)
       end
g (generic function with 1 method)

julia> g(10^10, 10^10)
1.2e-7

julia> @code_llvm f(1, 1)
;  @ REPL[18]:1 within `f`
define i64 @julia_f_1808(i64 signext %0, i64 signext %1) #0 {
top:
;  @ REPL[18]:3 within `f`
; ┌ @ range.jl:5 within `Colon`
; │┌ @ range.jl:397 within `UnitRange`
; ││┌ @ range.jl:404 within `unitrange_last`
     %.inv = icmp sgt i64 %0, 0
     %2 = mul i64 %1, %0
; └└└
  %spec.select = select i1 %.inv, i64 %2, i64 0
;  @ REPL[18]:6 within `f`
  ret i64 %spec.select
}

@LSchwerdt
Copy link
Contributor

Thanks! I know C++ does it this way and the corresponding documentation is easy to find, but I did not find the corresponding documentation for Julia.
But I guess it was stupid to think Julia may not work this way. That would be a total performance nightmare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster potential benchmark Could make a good benchmark in BaseBenchmarks sorting Put things in order
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2x faster tuple sorting (for 2-Tuples of Integers)
5 participants