Understanding the ConvLayer Schedule #8196
-
I would like to understand why the split dimension in the ConvLayer example performs so well. Consider the following section from the CPU schedule:
The schedule splits the loop over the out channels, and the inner loop is of size vec * tile_w. It splits the loop over the width dimension with an inner loop of size tile_h. Finally, the inner loop ci gets vectorized at the native size. Instead, I would have scheduled the following:
I realize that my choice leads to slower computations. Is there any intuition splitting over the channel and width dimension with inner loops whose tile size is named after a different dimension of the matrix is a good idea? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Consider a matrix multiply sum_k A_ij * B_jk. Note that the value of B you use in the inner loop doesn't depend on i, so you want to unroll over i so that you can load a value of B once and then use it multiple times. Similarly, the value of A you use doesn't depend on k, so you also want to unroll over k a bit for the same reason. For an n x m tile, this means you do O(n + m) loads and O(n * m) FMAs. Many other sane-looking schedules will do two loads per FMA. You want n and m to be as large as possible to maximize the number of FMAs per load. You stop when you either run out of registers and start spilling to the stack (introducing more loads and stores), or when you become limited by the number of FMAs (i.e. you hit peak flops, or close enough). So in short: for matrix multiple you tile and unroll over i and k, and this is the well-known standard way to implement a matrix multiply. A conv layer is basically a matrix multiply with extra indexing steps, and if you follow through the indexing differences, the above observation implies you should tile over c and then either x or y. |
Beta Was this translation helpful? Give feedback.
Consider a matrix multiply sum_k A_ij * B_jk. Note that the value of B you use in the inner loop doesn't depend on i, so you want to unroll over i so that you can load a value of B once and then use it multiple times. Similarly, the value of A you use doesn't depend on k, so you also want to unroll over k a bit for the same reason. For an n x m tile, this means you do O(n + m) loads and O(n * m) FMAs. Many other sane-looking schedules will do two loads per FMA. You want n and m to be as large as possible to maximize the number of FMAs per load. You stop when you either run out of registers and start spilling to the stack (introducing more loads and stores), or when you become limited by t…