-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plain for loop faster than @turbo #449
Comments
julia> function foo_debug(y, x, b)
return LoopVectorization.@turbo_debug for i in axes(y, 1), c in axes(y, 2), n in axes(y, 3)
y[i, c, n] = x[i, c, n] + b[c]
end
return nothing
end
foo_debug (generic function with 1 method)
julia> ls = foo_debug(y,x,b);
julia> LoopVectorization.choose_order(ls)
([:c, :n, :i], :n, Symbol("##undefined##"), :i, 4, -1) It was ordering the loops julia> using LinuxPerf
julia> foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> Base.donotdelete(f(arg1, args...)), 1:N)
foreachf (generic function with 1 method)
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(foo!, 10_000, y, x, b)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 6.11e+09 33.3% # 4.0 cycles per ns
└ task-clock 1.54e+09 33.3% # 1.5 s
┌ instructions 7.56e+08 66.7% # 0.1 insns per cycle
│ branch-instructions 2.38e+07 66.7% # 3.2% of insns
└ branch-misses 2.47e+04 66.7% # 0.1% of branch insns
┌ L1-dcache-load-misses 6.69e+08 33.3% # 200.9% of dcache loads
│ L1-dcache-loads 3.33e+08 33.3%
│ cache-misses 1.26e+05 33.3% # 0.0% of cache refs
└ cache-references 6.51e+08 33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(foo2!, 10_000, y, x, b)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 6.11e+09 33.3% # 4.0 cycles per ns
└ task-clock 1.53e+09 33.3% # 1.5 s
┌ instructions 8.91e+08 66.7% # 0.1 insns per cycle
│ branch-instructions 2.60e+07 66.7% # 2.9% of insns
└ branch-misses 3.13e+04 66.7% # 0.1% of branch insns
┌ L1-dcache-load-misses 6.57e+08 33.3% # 157.4% of dcache loads
│ L1-dcache-loads 4.17e+08 33.3%
│ cache-misses 7.52e+04 33.3% # 0.0% of cache refs
└ cache-references 6.55e+08 33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ I'm running them 10_000 times to get a decent sample.
julia> @btime foo!($y, $x, $b)
149.690 μs (0 allocations: 0 bytes)
julia> @btime foo2!($y, $x, $b)
148.181 μs (0 allocations: 0 bytes)
julia> @btime foo3!($y, $x, $b)
148.989 μs (0 allocations: 0 bytes)
julia> @btime foo4!($y, $x, $b) # foo3! but with `@inbounds`
149.214 μs (0 allocations: 0 bytes) Notice the pitiful Why so few instructions/cycle? Because in all 4 versions of the code, the CPU spends all of its time waiting on memory. Still, it's the same amount of memory being loaded, and I don't see that Unfortunately, But I can reproduce the timings you reported: julia> @btime foo!($y, $x, $b)
133.917 μs (0 allocations: 0 bytes)
julia> @btime foo2!($y, $x, $b)
51.708 μs (0 allocations: 0 bytes)
julia> @btime foo3!($y, $x, $b)
55.083 μs (0 allocations: 0 bytes)
julia> @btime foo4!($y, $x, $b)
55.083 μs (0 allocations: 0 bytes)
julia> function foo_debug(y, x, b)
return LoopVectorization.@turbo_debug for i in axes(y, 1), c in axes(y, 2), n in axes(y, 3)
y[i, c, n] = x[i, c, n] + b[c]
end
return nothing
end
foo_debug (generic function with 1 method)
julia> ls = foo_debug(y,x,b);
julia> LoopVectorization.choose_order(ls)
([:c, :n, :i], :n, Symbol("##undefined##"), :i, 4, -1) The M1 has much higher memory bandwidth than most x64 CPUs, including mine. If I shrink the size of the arrays on my x64 computer (the first one I benchmarked above), making them 8x smaller, the code runs about 30x faster, as memory becomes less of an issue: julia> y32 = randn_stridearray(64, 32, 32);
julia> x32 = randn_stridearray(64, 32, 32);
julia> @btime foo!($y32, $x32, $b)
5.777 μs (0 allocations: 0 bytes)
julia> @btime foo2!($y32, $x32, $b)
4.614 μs (0 allocations: 0 bytes)
julia> @btime foo3!($y32, $x32, $b)
5.007 μs (0 allocations: 0 bytes)
julia> @btime foo4!($y32, $x32, $b)
5.084 μs (0 allocations: 0 bytes) and now I can also see on this computer that If you want to try fiddling with it:
See the 10.0 there?Try making it larger. That's a coefficient on how heavily to penalize iterating over loops in a bad order. Increasing it to 50.0 seems to fix it:
julia> LoopVectorization.choose_order(ls)
([:n, :c, :i], :c, Symbol("##undefined##"), :i, 4, -1) Now With that, I now get: julia> @btime foo!($y32, $x32, $b)
4.620 μs (0 allocations: 0 bytes) for the I can make a PR, or you're welcome to play with it to try and find a heuristic that works well. |
BTW, have you seen VectorizedRNG.jl? julia> vrandn_stridearray(size...) = StrideArray(randn(local_rng(), Float32, size...), static.(size))
vrandn_stridearray (generic function with 1 method)
julia> @benchmark randn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 2344 samples with 1 evaluation.
Range (min … max): 1.969 ms … 3.901 ms ┊ GC (min … max): 0.00% … 44.07%
Time (median): 2.078 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.132 ms ± 267.936 μs ┊ GC (mean ± σ): 2.60% ± 7.65%
▃ ▇█▂
█▇▆████▇▆▅▅▃▆▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▅▃▄▆▅▄▆▅▃▃▁▁▁▁▃▄▇▇▆▅▅▄▆▆ █
1.97 ms Histogram: log(frequency) by time 3.54 ms <
Memory estimate: 2.00 MiB, allocs estimate: 6.
julia> @benchmark vrandn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 6752 samples with 1 evaluation.
Range (min … max): 622.333 μs … 3.232 ms ┊ GC (min … max): 0.00% … 77.36%
Time (median): 708.791 μs ┊ GC (median): 0.00%
Time (mean ± σ): 739.788 μs ± 165.578 μs ┊ GC (mean ± σ): 4.59% ± 10.53%
▄ ▃█▆▂ ▁ ▁
█▅▃▃████▇▄▅▄▄▄▄▃▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▇▇▅▄▅█▆█▆▅▁▁▁▁▃▃▃▃██ █
622 μs Histogram: log(frequency) by time 1.55 ms <
Memory estimate: 2.00 MiB, allocs estimate: 10. On my Linux x64 machine with AVX512: julia> vrandn_stridearray(size...) = StrideArray(randn(local_rng(), Float32, size...), static.(size))
vrandn_stridearray (generic function with 1 method)
julia> @benchmark randn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 2349 samples with 1 evaluation.
Range (min … max): 2.053 ms … 4.074 ms ┊ GC (min … max): 0.00% … 37.17%
Time (median): 2.092 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.126 ms ± 196.034 μs ┊ GC (mean ± σ): 1.46% ± 5.89%
█▅
▃███▇▄▄▁▄▁▃▃▃▄▆▅▁▁▁▁▁▃▄▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▃▆▅▅▅▅▅▅▄▃ █
2.05 ms Histogram: log(frequency) by time 3.29 ms <
Memory estimate: 2.00 MiB, allocs estimate: 6.
julia> @benchmark vrandn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 195.751 μs … 2.561 ms ┊ GC (min … max): 0.00% … 68.61%
Time (median): 202.900 μs ┊ GC (median): 0.00%
Time (mean ± σ): 235.975 μs ± 202.390 μs ┊ GC (mean ± σ): 12.61% ± 12.55%
█ ▁
██▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▇██ █
196 μs Histogram: log(frequency) by time 1.48 ms <
Memory estimate: 2.00 MiB, allocs estimate: 2. |
Thanks for the elaborate answer! I haven't tried VectorizedRNG but that could be extremely useful for a different project I am working on (sampling a distribution using random-walk metropolis). |
Uses batch sizes small enough to fit in cache should help, if possible. It could use more documentation and API need works (especially around memory management), but it does have a convolutional network example that performs much better than Flux or PyTorch do on the CPU, at least for that (small) problem size. |
I just tested it: Smaller batch sizes are indeed faster! (That's exactly the opposite behavior I see compared to pytorch+GPU). With SimpleChains, I also ran into this error: |
Consider the following:
I would expect
foo2
to perform slightly worse or the same asfoo
.However, the result is
What is happening here? Am I doing something wrong?
If it matters: I am running this on an Apple M1.
Edit:
A plain for loop is about as fast as
foo2!
:The text was updated successfully, but these errors were encountered: