Use optimal kernel parameters (architectures, matrix layouts) #34

SuperFluffy · 2018-12-03T14:09:56Z

I am trying to figure out what to use as optimal kernel parameter for different architectures.

For example, it looks like blis is using 8x4 for Sandy Bridge, but 8x6 for Haswell. Why? What lead them to this setup? Specifically, because operations are usually on 4 doubles at a time, how does the 6 fit in there. Is Haswell able to separately execute a _mm256 and a _mm operation at the same time?

Furthermore, if we have non-square kernels like for dgemm, is there a scenario where choosing 4x8 over 8x4 is better?

The text was updated successfully, but these errors were encountered:

bluss · 2018-12-03T17:38:40Z

You must also include src/archparam.rs in this

SuperFluffy · 2018-12-04T10:04:26Z

This discussion is revealing in terms of how to determine optimal kernel parameters: flame/blis#253

In particular, this states:

@VirtualEarth Turn your attention to Eq. 1:
  mr nr >= Nvec Lvfma Nvfma
On Intel Haswell/Broadwell/Skylake/Kabylake, Nvec is 4, Lvfma is 5, and Nvfma is 2. Thus, the register blocksize product mr nr must be at least 40 in order to overcome the floating-point latency. 6 x 8 = 48 more than satisfies this, but 8 x 6 would also be fine. The former is biased toward row-oriented SIMD output while the latter is better for column-oriented SIMD output. In BLIS, it almost never actually matters which one you pick because high-level logic will transpose the entire operation and make the output matrix C appear to be row- or column-stored, depending on the SIMD output "preference" of the microkernel.

Another interesting bit is the choice of 8x6 over 6x8 (or 8x4 over 4x8 for Sandy Bridge, which is our current implementation), which prefers column- vs row-storage in the C matrix. This then ties in with my question here: #31

mratsim · 2018-12-08T00:36:10Z

Hello there,

I'm the author of Arraymancer a Numpy + machine learning + deep learning written from scratch in Nim.

In the past month I've been building a new faster backend, with a specific focus on matrix multiplication as MKL, BLAS, BLIS were limiting my optimisation opportunities (like fusing neural network activations into GEMM or merging the im2col pass for convolution into matrix multiplication).

I've done extensive review of the literature here and added a lot of comments in my tiling code.

The most useful papers are:

[1] Anatomy of High-Performance Matrix Multiplication (Revised)
Kazushige Goto, Robert A. Van de Geijn
- http://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf

[2] Anatomy of High-Performance Many-Threaded Matrix Multiplication
Smith et al
- http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf

[3] Automating the Last-Mile for High Performance Dense Linear Algebra
Veras et al
- https://arxiv.org/pdf/1611.08035.pdf

[4] GEMM: From Pure C to SSE Optimized Micro Kernels
Michael Lehn
- http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html

I also keep some extra links that I didn't have time to sort.

Anyway, in terms of performance I have a generic multi-threaded BLAS (float32, float64, int32, int64) that reaches between 97~102% of OpenBLAS on my Broadwell (AVX2 and FMA) laptop depending if multithreaded/serial/float32 or float64:

Kernel parameters

mr * nr: 6 * 16.
- Why 16?: one-dimension must be a multiple of the vector size. With AVX it's 8. Also since CPU can issue 2 Fused-Multiply-Add in parallel (instruction-level parallelism) 16 is an easy choice.
- Why 6?: We have 16 general purposes register on SSE2~AVX2. We need to keep a pointer to packed A, packed B and the loop index over kc. So with 6 I can unroll twice, using 12 registers, and have 4 left for bookeeping.
- Why not 16x6?: C will be a MxN matrix, and very often C is a contiguous matrix that has been allocated. As by default I use row-major, 6x16 avoids transposing during the C update step and allow me to specialize when C as a unit stride along the N dimension.

Panel parameters

Proper tuning of mc and kc is very important as well.
Currently I use an arbitrary 768 bytes for mc and 2048 bytes for kc. In my testing I lost up to 35% performance with other values.

There are various constraint for both, the Goto paper goes quite in-depth into them:

micropanel of packed B of size kc * nr should remain in L1 cache, and so take less than half of it to not be evicted by other panels.
Panel of packed A of size kc * mc should take a considerable part of the L2 cache, but still stay addressable by the TLB.
I had an algorithm to choose them, but as I can only query cache but no TLB information at the moment, I removed it and decided to tune manually.

mratsim · 2018-12-08T00:59:42Z

I forgot to add. As mentioned in the paper Automating the last mile. You need to choose your SIMD for C updates.

You can go for shuffle/permute or broadcast and balance them to a varying degree.

To find the best you need to check from which port those instructions can be issued. Also interleave them with FMA to hide data fetching latencies.

In my code I use full broadcast and no shuffle/permute but mainly because it was simpler to reason about, I didn't test other config.

bluss · 2018-12-08T08:01:31Z

@mratsim Wow, cool to hear from you! Thanks for the links to the papers and for sharing your knowledge!

bluss · 2021-11-13T14:05:14Z

Issue #59 allows tweaking the NC, MC, KC variables easily at compile-time which is one small step and a model for further compile time tweakability.

ethanhs · 2021-11-25T22:05:44Z

Another idea which libsxmm uses is an autotuner, such as https://opentuner.org/. OpenTuner automatically evaluates the best parameters for the architecture the code is compiled on.

These values were taken from the Blis kernel configuration for Haswell [1], Setting NR to use 2 AVX registers takes advantage of the 2 FMA execution ports [2]. [1] https://github.com/flame/blis/blob/f956b79922da412791e4c8b8b846b3aafc0a5ee0/kernels/haswell/bli_kernels_haswell.h#L55 [2] bluss/matrixmultiply#34 (comment)

These values were taken from the Blis kernel configuration for Haswell [1], Setting NR to use 2 AVX registers takes advantage of the 2 FMA execution ports [2]. The clippy lint about the compiler optimizing away constant assertions was disabled because that is exactly the behavior that we want. See also rust-lang/rust-clippy#8159. [1] https://github.com/flame/blis/blob/f956b79922da412791e4c8b8b846b3aafc0a5ee0/kernels/haswell/bli_kernels_haswell.h#L55 [2] bluss/matrixmultiply#34 (comment)

mratsim mentioned this issue Apr 20, 2024

Provide benchmark with throughput units (GFlops/s TFlops/s) sarah-quinones/gemm#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use optimal kernel parameters (architectures, matrix layouts) #34

Use optimal kernel parameters (architectures, matrix layouts) #34

SuperFluffy commented Dec 3, 2018

bluss commented Dec 3, 2018 •

edited

Loading

SuperFluffy commented Dec 4, 2018

mratsim commented Dec 8, 2018

mratsim commented Dec 8, 2018

bluss commented Dec 8, 2018

bluss commented Nov 13, 2021

ethanhs commented Nov 25, 2021

Use optimal kernel parameters (architectures, matrix layouts) #34

Use optimal kernel parameters (architectures, matrix layouts) #34

Comments

SuperFluffy commented Dec 3, 2018

bluss commented Dec 3, 2018 • edited Loading

SuperFluffy commented Dec 4, 2018

mratsim commented Dec 8, 2018

Kernel parameters

Panel parameters

mratsim commented Dec 8, 2018

bluss commented Dec 8, 2018

bluss commented Nov 13, 2021

ethanhs commented Nov 25, 2021

bluss commented Dec 3, 2018 •

edited

Loading