You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to leverage the speedups from @turbo for a case with stride-2 access of the input data . This actually seemed to degrade performance compared to @inbounds, and it got even worse with @tturbo. I realize that it is much more tricky with non-contiguous data, but wasn't expecting this much of a degradation. Didn't manage to find much info, so perhaps it is a rare use case. Is this to be expected or should something be done differently?
I tried to boil it down to the following mwe.
Thanks!
using LoopVectorization
function test_stride2_inbounds(out, x)
@inbounds for k = 3:length(x)÷2 # @turbo does not work, unsure why
acc = 0
acc += x[2k-1]
acc += x[2k-2]
acc += x[2k-3]
acc += x[2k-4]
acc += x[2k-5]
out[k] = acc
end
return out
end
function test_stride2_turbo(out, x)
@turbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
acc = 0
acc += x[2k-1]
acc += x[2k-2]
acc += x[2k-3]
acc += x[2k-4]
acc += x[2k-5]
out[k] = acc
end
return out
end
function test_stride2_tturbo(out, x)
@tturbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
acc = 0
acc += x[2k-1]
acc += x[2k-2]
acc += x[2k-3]
acc += x[2k-4]
acc += x[2k-5]
out[k] = acc
end
return out
end
function test_stride1_turbo(out, x)
@turbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
acc = 0
acc += x[k-1]
acc += x[k-2]
acc += x[k-3]
acc += x[k-4]
acc += x[k-5]
out[k] = acc
end
return out
end
function test_stride1_tturbo(out, x)
@tturbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
acc = 0
acc += x[k-1]
acc += x[k-2]
acc += x[k-3]
acc += x[k-4]
acc += x[k-5]
out[k] = acc
end
return out
end
x = rand(-1000:1000, 2^17)
out1, out2, out3, out4, out5 = (similar(x) for _=1:5)
println("# Threads = $(Threads.nthreads())")
println("\ntest_stride2_inbounds:")
display( @benchmark test_stride2_inbounds(out1, x) )
println("\ntest_stride2_turbo:")
display( @benchmark test_stride2_turbo(out2, x) )
println("\ntest_stride2_tturbo:")
display( @benchmark test_stride2_tturbo(out3, x) )
println("\ntest_stride1_turbo:")
display( @benchmark test_stride1_turbo(out4, x) )
println("\ntest_stride1_tturbo:")
display( @benchmark test_stride1_tturbo(out5, x) )
Hello,
I tried to leverage the speedups from
@turbo
for a case with stride-2 access of the input data . This actually seemed to degrade performance compared to@inbounds
, and it got even worse with@tturbo
. I realize that it is much more tricky with non-contiguous data, but wasn't expecting this much of a degradation. Didn't manage to find much info, so perhaps it is a rare use case. Is this to be expected or should something be done differently?I tried to boil it down to the following mwe.
Thanks!
The text was updated successfully, but these errors were encountered: