Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plain for loop faster than @turbo #449

Open
cafaxo opened this issue Dec 11, 2022 · 5 comments
Open

Plain for loop faster than @turbo #449

cafaxo opened this issue Dec 11, 2022 · 5 comments

Comments

@cafaxo
Copy link

cafaxo commented Dec 11, 2022

Consider the following:

using LoopVectorization, StrideArraysCore, BenchmarkTools

randn_stridearray(size...) = StrideArray(randn(Float32, size...), static.(size))

y = randn_stridearray(64, 32, 256);
x = randn_stridearray(64, 32, 256);
b = randn_stridearray(32);

function foo!(y, x, b)
    @turbo for i in axes(y, 1), c in axes(y, 2), n in axes(y, 3)
        y[i, c, n] = x[i, c, n] + b[c]
    end
    return nothing
end

function foo2!(y, x, b)
    for n in axes(y, 3)
        @turbo for i in axes(y, 1), c in axes(y, 2)
            y[i, c, n] = x[i, c, n] + b[c]
        end
    end
    return nothing
end

I would expect foo2 to perform slightly worse or the same as foo.
However, the result is

julia> @btime foo!($y, $x, $b)
  153.958 μs (0 allocations: 0 bytes)

julia> @btime foo2!($y, $x, $b)
  51.583 μs (0 allocations: 0 bytes)

What is happening here? Am I doing something wrong?
If it matters: I am running this on an Apple M1.

Edit:

A plain for loop is about as fast as foo2!:

julia> function foo3!(y, x, b)
           for n in axes(y, 3)
               for c in axes(y, 2)
                   for i in axes(y, 1)
                       y[i, c, n] = x[i, c, n] + b[c]
                   end
               end
           end
           return nothing
       end
foo3! (generic function with 1 method)

julia> @btime foo3!($y, $x, $b)
  52.375 μs (0 allocations: 0 bytes)
@cafaxo cafaxo changed the title Inefficient loop order Plain for loop faster than @turbo Dec 11, 2022
@chriselrod
Copy link
Member

julia> function foo_debug(y, x, b)
           return LoopVectorization.@turbo_debug for i in axes(y, 1), c in axes(y, 2), n in axes(y, 3)
               y[i, c, n] = x[i, c, n] + b[c]
           end
           return nothing
       end
foo_debug (generic function with 1 method)

julia> ls = foo_debug(y,x,b);

julia> LoopVectorization.choose_order(ls)
([:c, :n, :i], :n, Symbol("##undefined##"), :i, 4, -1)

It was ordering the loops c, n, i, probably to try and minimize the amount of times you reload from b.
So foo! does use fewer CPU instructions than foo2!:

julia> using LinuxPerf

julia> foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> Base.donotdelete(f(arg1, args...)), 1:N)
foreachf (generic function with 1 method)

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(foo!, 10_000, y, x, b)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               6.11e+09   33.3%  #  4.0 cycles per ns
└ task-clock               1.54e+09   33.3%  #  1.5 s
┌ instructions             7.56e+08   66.7%  #  0.1 insns per cycle
│ branch-instructions      2.38e+07   66.7%  #  3.2% of insns
└ branch-misses            2.47e+04   66.7%  #  0.1% of branch insns
┌ L1-dcache-load-misses    6.69e+08   33.3%  # 200.9% of dcache loads
│ L1-dcache-loads          3.33e+08   33.3%
│ cache-misses             1.26e+05   33.3%  #  0.0% of cache refs
└ cache-references         6.51e+08   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(foo2!, 10_000, y, x, b)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               6.11e+09   33.3%  #  4.0 cycles per ns
└ task-clock               1.53e+09   33.3%  #  1.5 s
┌ instructions             8.91e+08   66.7%  #  0.1 insns per cycle
│ branch-instructions      2.60e+07   66.7%  #  2.9% of insns
└ branch-misses            3.13e+04   66.7%  #  0.1% of branch insns
┌ L1-dcache-load-misses    6.57e+08   33.3%  # 157.4% of dcache loads
│ L1-dcache-loads          4.17e+08   33.3%
│ cache-misses             7.52e+04   33.3%  #  0.0% of cache refs
└ cache-references         6.55e+08   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

I'm running them 10_000 times to get a decent sample.
Note that foo2! took 8.91e4 instructions per call, while foo! required only 7.56e4.

@turbo doesn't seem to properly be considering that traversing the memory of y[i, c, n] and x[i, c, n] in the c, n, i order is potentially more costly than n, c, i.
"Potentially", FWIW,

julia> @btime foo!($y, $x, $b)
  149.690 μs (0 allocations: 0 bytes)

julia> @btime foo2!($y, $x, $b)
  148.181 μs (0 allocations: 0 bytes)

julia> @btime foo3!($y, $x, $b)
  148.989 μs (0 allocations: 0 bytes)

julia> @btime foo4!($y, $x, $b) # foo3! but with `@inbounds`
  149.214 μs (0 allocations: 0 bytes)

Notice the pitiful 0.1 insns per cycle from my above @pstats reports. That means the CPU is only executing one instruction every 10 clock cycles. This is really bad, this CPU is theoretically capable of up to 4 instructions/cycle, 40x higher.

Why so few instructions/cycle? Because in all 4 versions of the code, the CPU spends all of its time waiting on memory.
Memory bandwidth is the bottleneck, so what the CPU is doing/how much work it has to do is totally irrelevant.

Still, it's the same amount of memory being loaded, and I don't see that foo! (doing an order meant to minimize the number of instructions needed) is any worse than the others (which traverse in a theoretically nicer order).

Unfortunately, LinuxPerf is Linux-only.

But I can reproduce the timings you reported:

julia> @btime foo!($y, $x, $b)
  133.917 μs (0 allocations: 0 bytes)

julia> @btime foo2!($y, $x, $b)
  51.708 μs (0 allocations: 0 bytes)

julia> @btime foo3!($y, $x, $b)
  55.083 μs (0 allocations: 0 bytes)

julia> @btime foo4!($y, $x, $b)
  55.083 μs (0 allocations: 0 bytes)

julia> function foo_debug(y, x, b)
           return LoopVectorization.@turbo_debug for i in axes(y, 1), c in axes(y, 2), n in axes(y, 3)
               y[i, c, n] = x[i, c, n] + b[c]
           end
           return nothing
       end
foo_debug (generic function with 1 method)

julia> ls = foo_debug(y,x,b);

julia> LoopVectorization.choose_order(ls)
([:c, :n, :i], :n, Symbol("##undefined##"), :i, 4, -1)

The M1 has much higher memory bandwidth than most x64 CPUs, including mine.
Hence it's able to do much better here, at least given the nicer order.

If I shrink the size of the arrays on my x64 computer (the first one I benchmarked above), making them 8x smaller, the code runs about 30x faster, as memory becomes less of an issue:

julia> y32 = randn_stridearray(64, 32, 32);

julia> x32 = randn_stridearray(64, 32, 32);

julia> @btime foo!($y32, $x32, $b)
  5.777 μs (0 allocations: 0 bytes)

julia> @btime foo2!($y32, $x32, $b)
  4.614 μs (0 allocations: 0 bytes)

julia> @btime foo3!($y32, $x32, $b)
  5.007 μs (0 allocations: 0 bytes)

julia> @btime foo4!($y32, $x32, $b)
  5.084 μs (0 allocations: 0 bytes)

and now I can also see on this computer that foo! seems to be worse than foo2!.

If you want to try fiddling with it:

10.0sum(maximum, values(stridepenaltydict)) *

See the 10.0 there?
Try making it larger.
That's a coefficient on how heavily to penalize iterating over loops in a bad order. Increasing it to 50.0 seems to fix it:

julia> LoopVectorization.choose_order(ls)
([:n, :c, :i], :c, Symbol("##undefined##"), :i, 4, -1)

Now n would be the outer-most loop. You'd have to actually force recompilation after making a change like that (e.g. restarting Julia, or editing _turbo_!, such as by commenting out or in a useless line like 1+1).

With that, I now get:

julia> @btime foo!($y32, $x32, $b)
  4.620 μs (0 allocations: 0 bytes)

for the size(..., 3) == 32 example.

I can make a PR, or you're welcome to play with it to try and find a heuristic that works well.

@chriselrod
Copy link
Member

BTW, have you seen VectorizedRNG.jl?
On the M1:

julia> vrandn_stridearray(size...) = StrideArray(randn(local_rng(), Float32, size...), static.(size))
vrandn_stridearray (generic function with 1 method)

julia> @benchmark randn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 2344 samples with 1 evaluation.
 Range (min  max):  1.969 ms    3.901 ms  ┊ GC (min  max): 0.00%  44.07%
 Time  (median):     2.078 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.132 ms ± 267.936 μs  ┊ GC (mean ± σ):  2.60% ±  7.65%

  ▃  ▇█▂
  █▇▆████▇▆▅▅▃▆▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▅▃▄▆▅▄▆▅▃▃▁▁▁▁▃▄▇▇▆▅▅▄▆▆ █
  1.97 ms      Histogram: log(frequency) by time      3.54 ms <

 Memory estimate: 2.00 MiB, allocs estimate: 6.

julia> @benchmark vrandn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 6752 samples with 1 evaluation.
 Range (min  max):  622.333 μs    3.232 ms  ┊ GC (min  max): 0.00%  77.36%
 Time  (median):     708.791 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   739.788 μs ± 165.578 μs  ┊ GC (mean ± σ):  4.59% ± 10.53%

  ▄   ▃█▆▂                                                    ▁ ▁
  █▅▃▃████▇▄▅▄▄▄▄▃▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▇▇▅▄▅█▆█▆▅▁▁▁▁▃▃▃▃██ █
  622 μs        Histogram: log(frequency) by time       1.55 ms <

 Memory estimate: 2.00 MiB, allocs estimate: 10.

On my Linux x64 machine with AVX512:

julia> vrandn_stridearray(size...) = StrideArray(randn(local_rng(), Float32, size...), static.(size))
vrandn_stridearray (generic function with 1 method)

julia> @benchmark randn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 2349 samples with 1 evaluation.
 Range (min  max):  2.053 ms    4.074 ms  ┊ GC (min  max): 0.00%  37.17%
 Time  (median):     2.092 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.126 ms ± 196.034 μs  ┊ GC (mean ± σ):  1.46% ±  5.89%

   █▅
  ▃███▇▄▄▁▄▁▃▃▃▄▆▅▁▁▁▁▁▃▄▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▃▆▅▅▅▅▅▅▄▃ █
  2.05 ms      Histogram: log(frequency) by time      3.29 ms <

 Memory estimate: 2.00 MiB, allocs estimate: 6.

julia> @benchmark vrandn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  195.751 μs    2.561 ms  ┊ GC (min  max):  0.00%  68.61%
 Time  (median):     202.900 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   235.975 μs ± 202.390 μs  ┊ GC (mean ± σ):  12.61% ± 12.55%

  █                                                             ▁
  ██▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▇██ █
  196 μs        Histogram: log(frequency) by time       1.48 ms <

 Memory estimate: 2.00 MiB, allocs estimate: 2.

@cafaxo
Copy link
Author

cafaxo commented Dec 11, 2022

Thanks for the elaborate answer!
I'll try playing with that constant.
But yeah, I am trying to train a convolutional neural network for chess on the CPU and memory seems to be quite the bottleneck. I'll have to check with perf on my linux machine.

I haven't tried VectorizedRNG but that could be extremely useful for a different project I am working on (sampling a distribution using random-walk metropolis).

@chriselrod
Copy link
Member

chriselrod commented Dec 12, 2022

But yeah, I am trying to train a convolutional neural network for chess on the CPU and memory seems to be quite the bottleneck. I'll have to check with perf on my linux machine.

Uses batch sizes small enough to fit in cache should help, if possible.
Have you looked at SimpleChains.jl at all?

It could use more documentation and API need works (especially around memory management), but it does have a convolutional network example that performs much better than Flux or PyTorch do on the CPU, at least for that (small) problem size.

@cafaxo
Copy link
Author

cafaxo commented Dec 12, 2022

I just tested it: Smaller batch sizes are indeed faster! (That's exactly the opposite behavior I see compared to pytorch+GPU).
I first looked at SimpleChains.jl and the performance of that convinced me that I could reach better training speed with my use case on the M1's CPU compared to pytorch with the GPU backend.
However, I needed features such as zero-padding and residual connections which SimpleChains does not support.
Some parts of SimpleChains seemed somewhat convoluted (pun intended). You seem to be doing some nontrivial stuff for memory management, right? In any case, in order to get a better intuition what the backward passes looked like for my network, I decided to roll my own forward+backward passes on top of StrideArrays+LoopVec.

With SimpleChains, I also ran into this error:
https://julialang.slack.com/archives/C690QRAA3/p1668460255442179
Should I open an issue for that?
Also, by the way: The way that the transpose for 2d convolution is implemented in SimpleChains (manually checking for inbounds in the inner for loop) is super slow on my machine. I am getting much better performance by first copying into a preallocated zero-padded array and then doing the convolution without the inbounds check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants