-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use optimal kernel parameters (architectures, matrix layouts) #34
Comments
You must also include src/archparam.rs in this |
This discussion is revealing in terms of how to determine optimal kernel parameters: flame/blis#253 In particular, this states:
Another interesting bit is the choice of 8x6 over 6x8 (or 8x4 over 4x8 for Sandy Bridge, which is our current implementation), which prefers column- vs row-storage in the C matrix. This then ties in with my question here: #31 |
Hello there, I'm the author of Arraymancer a Numpy + machine learning + deep learning written from scratch in Nim. In the past month I've been building a new faster backend, with a specific focus on matrix multiplication as MKL, BLAS, BLIS were limiting my optimisation opportunities (like fusing neural network activations into GEMM or merging the im2col pass for convolution into matrix multiplication). I've done extensive review of the literature here and added a lot of comments in my tiling code. The most useful papers are:
I also keep some extra links that I didn't have time to sort. Anyway, in terms of performance I have a generic multi-threaded BLAS (float32, float64, int32, int64) that reaches between 97~102% of OpenBLAS on my Broadwell (AVX2 and FMA) laptop depending if multithreaded/serial/float32 or float64: Kernel parameters
Panel parametersProper tuning of mc and kc is very important as well. There are various constraint for both, the Goto paper goes quite in-depth into them:
|
I forgot to add. As mentioned in the paper Automating the last mile. You need to choose your SIMD for C updates. You can go for shuffle/permute or broadcast and balance them to a varying degree. To find the best you need to check from which port those instructions can be issued. Also interleave them with FMA to hide data fetching latencies. In my code I use full broadcast and no shuffle/permute but mainly because it was simpler to reason about, I didn't test other config. |
@mratsim Wow, cool to hear from you! Thanks for the links to the papers and for sharing your knowledge! |
Issue #59 allows tweaking the NC, MC, KC variables easily at compile-time which is one small step and a model for further compile time tweakability. |
Another idea which libsxmm uses is an autotuner, such as https://opentuner.org/. OpenTuner automatically evaluates the best parameters for the architecture the code is compiled on. |
These values were taken from the Blis kernel configuration for Haswell [1], Setting NR to use 2 AVX registers takes advantage of the 2 FMA execution ports [2]. [1] https://github.com/flame/blis/blob/f956b79922da412791e4c8b8b846b3aafc0a5ee0/kernels/haswell/bli_kernels_haswell.h#L55 [2] bluss/matrixmultiply#34 (comment)
These values were taken from the Blis kernel configuration for Haswell [1], Setting NR to use 2 AVX registers takes advantage of the 2 FMA execution ports [2]. The clippy lint about the compiler optimizing away constant assertions was disabled because that is exactly the behavior that we want. See also rust-lang/rust-clippy#8159. [1] https://github.com/flame/blis/blob/f956b79922da412791e4c8b8b846b3aafc0a5ee0/kernels/haswell/bli_kernels_haswell.h#L55 [2] bluss/matrixmultiply#34 (comment)
I am trying to figure out what to use as optimal kernel parameter for different architectures.
For example, it looks like blis is using 8x4 for Sandy Bridge, but 8x6 for Haswell. Why? What lead them to this setup? Specifically, because operations are usually on 4 doubles at a time, how does the 6 fit in there. Is Haswell able to separately execute a
_mm256
and a_mm
operation at the same time?Furthermore, if we have non-square kernels like for dgemm, is there a scenario where choosing 4x8 over 8x4 is better?
The text was updated successfully, but these errors were encountered: