Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance for stride 2 #517

Open
olof3 opened this issue Dec 15, 2023 · 0 comments
Open

Performance for stride 2 #517

olof3 opened this issue Dec 15, 2023 · 0 comments

Comments

@olof3
Copy link

olof3 commented Dec 15, 2023

Hello,

I tried to leverage the speedups from @turbo for a case with stride-2 access of the input data . This actually seemed to degrade performance compared to @inbounds, and it got even worse with @tturbo. I realize that it is much more tricky with non-contiguous data, but wasn't expecting this much of a degradation. Didn't manage to find much info, so perhaps it is a rare use case. Is this to be expected or should something be done differently?

I tried to boil it down to the following mwe.

Thanks!

using LoopVectorization

function test_stride2_inbounds(out, x)
    @inbounds for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[2k-1]
        acc += x[2k-2]
        acc += x[2k-3]
        acc += x[2k-4]
        acc += x[2k-5]
        out[k] = acc
    end

    return out
end

function test_stride2_turbo(out, x)
    @turbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[2k-1]
        acc += x[2k-2]
        acc += x[2k-3]
        acc += x[2k-4]
        acc += x[2k-5]
        out[k] = acc
    end

    return out
end

function test_stride2_tturbo(out, x)
    @tturbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[2k-1]
        acc += x[2k-2]
        acc += x[2k-3]
        acc += x[2k-4]
        acc += x[2k-5]
        out[k] = acc
    end

    return out
end


function test_stride1_turbo(out, x)
    @turbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[k-1]
        acc += x[k-2]
        acc += x[k-3]
        acc += x[k-4]
        acc += x[k-5]
        out[k] = acc
    end

    return out
end

function test_stride1_tturbo(out, x)
    @tturbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[k-1]
        acc += x[k-2]
        acc += x[k-3]
        acc += x[k-4]
        acc += x[k-5]
        out[k] = acc
    end

    return out
end


x = rand(-1000:1000, 2^17)
out1, out2, out3, out4, out5 = (similar(x) for _=1:5)


println("# Threads = $(Threads.nthreads())")

println("\ntest_stride2_inbounds:")
display( @benchmark test_stride2_inbounds(out1, x) )
println("\ntest_stride2_turbo:")
display( @benchmark test_stride2_turbo(out2, x) )
println("\ntest_stride2_tturbo:")
display( @benchmark test_stride2_tturbo(out3, x) )
println("\ntest_stride1_turbo:")
display( @benchmark test_stride1_turbo(out4, x) )
println("\ntest_stride1_tturbo:")
display( @benchmark test_stride1_tturbo(out5, x) )
# Threads = 10

test_stride2_inbounds:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  53.300 μs … 867.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     56.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.665 μs ±  23.947 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃   ▁   █▄▁  ▁▁   ▃▇    ▄▇▂   ▂▁      ▄▁    ▁               ▂
  ██▇▅███▇▅███▆▇███▇▆███▅▇▆█████▅██▇█▆▆▅▅██▇▆▅▁█▆▅▄▆▅▄▆▄▄▁▄▄▁▅ █
  53.3 μs       Histogram: log(frequency) by time      74.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

test_stride2_turbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   80.900 μs …   3.360 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     110.400 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   134.626 μs ± 133.504 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █   
  ▄██▄▂▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▁▁▂▂▂▂▁▂▁▂▂▁▂▂▂▂▂▂▂▁▂▂▂▂▁▂ ▂
  80.9 μs          Histogram: frequency by time         1.01 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

test_stride2_tturbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  181.400 μs …  10.707 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     209.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   242.770 μs ± 203.921 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅█▅▃▃▃▂▁▁▁▁                                                  ▁
  ██████████████▇▇▇█▇▇▇▇▇▆▆▆▇▅▆▅▄▆▄▆▅▆▆▄▄▃▄▅▅▄▅▅▅▆▃▅▅▅▅▅▅▅▄▄▄▅▃ █
  181 μs        Histogram: log(frequency) by time        938 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

test_stride1_turbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  24.900 μs … 956.400 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     31.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   32.754 μs ±  30.490 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▅▄▂▁     ▄█▆▇▇▆▃▂▁▆▄▂▄▃▁▁                                   ▂
  ██████▇▇▆▆████████████████▇█▇█▇▇▇▆▇▅▅▅▅▃▄▂▃▃▃▄▅▄▃▄▃▄▅▃▄▃▃▃▃▂ █
  24.9 μs       Histogram: log(frequency) by time        52 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

test_stride1_tturbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  12.100 μs …  2.709 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.181 μs ± 51.142 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▆▃▄▁▁▂                                                    ▂
  ███████████▇▇▆▇▇▆▆▆▅▅▅▅▅▅▅▅▆▅▅▃▄▅▄▄▅▃▅▅▅▄▄▃▁▄▅▅▅▅▅▅▅▅▅▁▃▄▄▅ █
  12.1 μs      Histogram: log(frequency) by time       132 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant