Understanding the ConvLayer Schedule #8196

FabianSchuetze · 2024-04-16T08:02:34Z

FabianSchuetze
Apr 16, 2024

I would like to understand why the split dimension in the ConvLayer example performs so well.

Consider the following section from the CPU schedule:

            relu
            	.split(c, co, ci, vec * tile_w)
                .split(x, xo, xi, tile_h)
                .vectorize(ci, vec)

The schedule splits the loop over the out channels, and the inner loop is of size vec * tile_w. It splits the loop over the width dimension with an inner loop of size tile_h. Finally, the inner loop ci gets vectorized at the native size.

Instead, I would have scheduled the following:

            relu.split(x, xo, xi, vec * tile_w)
                .split(y, yo, yi, tile_h)
                .vectorize(xi, vec)

I realize that my choice leads to slower computations. Is there any intuition splitting over the channel and width dimension with inner loops whose tile size is named after a different dimension of the matrix is a good idea?

Answered by abadams

Apr 16, 2024

Consider a matrix multiply sum_k A_ij * B_jk. Note that the value of B you use in the inner loop doesn't depend on i, so you want to unroll over i so that you can load a value of B once and then use it multiple times. Similarly, the value of A you use doesn't depend on k, so you also want to unroll over k a bit for the same reason. For an n x m tile, this means you do O(n + m) loads and O(n * m) FMAs. Many other sane-looking schedules will do two loads per FMA. You want n and m to be as large as possible to maximize the number of FMAs per load. You stop when you either run out of registers and start spilling to the stack (introducing more loads and stores), or when you become limited by t…

View full answer

abadams · 2024-04-16T17:20:29Z

abadams
Apr 16, 2024
Maintainer

Consider a matrix multiply sum_k A_ij * B_jk. Note that the value of B you use in the inner loop doesn't depend on i, so you want to unroll over i so that you can load a value of B once and then use it multiple times. Similarly, the value of A you use doesn't depend on k, so you also want to unroll over k a bit for the same reason. For an n x m tile, this means you do O(n + m) loads and O(n * m) FMAs. Many other sane-looking schedules will do two loads per FMA. You want n and m to be as large as possible to maximize the number of FMAs per load. You stop when you either run out of registers and start spilling to the stack (introducing more loads and stores), or when you become limited by the number of FMAs (i.e. you hit peak flops, or close enough).

So in short: for matrix multiple you tile and unroll over i and k, and this is the well-known standard way to implement a matrix multiply.

A conv layer is basically a matrix multiply with extra indexing steps, and if you follow through the indexing differences, the above observation implies you should tile over c and then either x or y.

1 reply

FabianSchuetze Apr 17, 2024
Author

Andrew, thank you very much - this is a phenomenal answer indeed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the ConvLayer Schedule #8196

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Understanding the ConvLayer Schedule #8196

FabianSchuetze Apr 16, 2024

Replies: 1 comment · 1 reply

abadams Apr 16, 2024 Maintainer

FabianSchuetze Apr 17, 2024 Author

FabianSchuetze
Apr 16, 2024

Replies: 1 comment 1 reply

abadams
Apr 16, 2024
Maintainer

FabianSchuetze Apr 17, 2024
Author