-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: i32 gemm experiment #28
base: master
Are you sure you want to change the base?
Conversation
I was trying the This is on my laptop though, and the CPU might get downclocked quite badly when running AVX2 instructions. I am going to check the performance on amazon ec2.
|
There are no 64-bit multiplies in simd avx, so don't try that. i32 will autovectorize on avx, so it's a bit fun.
Results shows the same information we know from packing from the other implementations (the packing is all shared code and doesn't care about the element. Only element size and kernel size). running 8 tests test layout_i32_032::ccc ... bench: 6,012 ns/iter (+/- 80) test layout_i32_032::ccf ... bench: 6,010 ns/iter (+/- 68) test layout_i32_032::cfc ... bench: 6,244 ns/iter (+/- 89) test layout_i32_032::cff ... bench: 6,257 ns/iter (+/- 240) test layout_i32_032::fcc ... bench: 5,806 ns/iter (+/- 320) test layout_i32_032::fcf ... bench: 5,767 ns/iter (+/- 2,013) test layout_i32_032::ffc ... bench: 6,005 ns/iter (+/- 187) test layout_i32_032::fff ... bench: 6,031 ns/iter (+/- 187)
I can't reproduce the slowdown, likely because I don't have avx2 on this machine, only avx. |
6006cc3
to
2946204
Compare
What needs to be answered is, does it make sense to provide this with wrapping semantics? Saturating semantics is not an alternative, unless we find a simd implementation of that. The wrapping implementation is at least reasonably fast and autovectorizes already. |
Assuming you are fine with
For multiplication we have:
In my mind, If we restrict ourselves to Since we know that we only use the first half of the So I think we should:
For my personal use case, I am only interested in multiplying integer matrices consisting of Similar add and mul instructions are also available for Given that we then know that during the multiplication phase we won't have overflow, I'd prefer to have saturating semantics, because they feel less “wrong” to me. |
Hey! I'm currently benchmarking Facebook's PyTorch Glow matrix multiplication (pytorch/glow#1749 (comment)) and I noticed that they have a i8 GEMM implementation here Regarding slowdown, on integer GEMM I think it's normal, I have to benchmark MKL integer GEMM but I don't expect much. This is my single-threaded performance in Laser
My kernel config is the following: template int32x8_muladd_unfused_avx2(a, b, c: m256i): m256i =
mm256_add_epi32(mm256_mullo_epi32(a, b), c)
ukernel_generator(
x86_AVX2,
typ = int32,
vectype = m256i,
nb_scalars = 8,
simd_setZero = mm256_setzero_si256,
simd_broadcast_value = mm256_set1_epi32,
simd_load_aligned = mm256_load_si256,
simd_load_unaligned = mm256_loadu_si256,
simd_store_unaligned = mm256_storeu_si256,
simd_mul = mm256_mullo_epi32,
simd_add = mm256_add_epi32,
simd_fma = int32x8_muladd_unfused_avx2
) This is a generic kernel config, I just need to swap the types and SIMD called. It allows me to reach 98%~101% of OpenBLAS on float32 with this config: ukernel_generator(
x86_AVX_FMA,
typ = float32,
vectype = m256,
nb_scalars = 8,
simd_setZero = mm256_setzero_ps,
simd_broadcast_value = mm256_set1_ps,
simd_load_aligned = mm256_load_ps,
simd_load_unaligned = mm256_loadu_ps,
simd_store_unaligned = mm256_storeu_ps,
simd_mul = mm256_mul_ps,
simd_add = mm256_add_ps,
simd_fma = mm256_fmadd_ps
)
ukernel_generator(
x86_AVX_FMA,
typ = float64,
vectype = m256d,
nb_scalars = 4,
simd_setZero = mm256_setzero_pd,
simd_broadcast_value = mm256_set1_pd,
simd_load_aligned = mm256_load_pd,
simd_load_unaligned = mm256_loadu_pd,
simd_store_unaligned = mm256_storeu_pd,
simd_mul = mm256_mul_pd,
simd_add = mm256_add_pd,
simd_fma = mm256_fmadd_pd,
) So I think integer SIMD is just slow (latency bottleneck?). For reference due to the lack of mm256_mullo_epi64 in AVX2, I use scalar code fallback for int64 and can surprisingly achieve 10 GINTOP/s on my machine. type Int64x2 = array[2, int64]
func setZero_int64_sse2_fallback(): Int64x2 {.inline.} =
discard
template set1_int64_sse2_fallback(a: int64): Int64x2 =
[a, a]
func load_int64_sse2_fallback(mem_addr: ptr int64): Int64x2 {.inline.}=
let p = cast[ptr UncheckedArray[int64]](mem_addr)
[p[0], p[1]]
func store_int64_sse2_fallback(mem_addr: ptr int64, a: Int64x2) {.inline.}=
let p = cast[ptr UncheckedArray[int64]](mem_addr)
p[0] = a[0]
p[1] = a[1]
template add_int64_sse2_fallback(a, b: Int64x2): Int64x2 =
[a[0] + b[0], a[1] + b[1]]
template mul_int64_sse2_fallback(a, b: Int64x2): Int64x2 =
[a[0] * b[0], a[1] * b[1]]
template fma_int64_sse2_fallback(a, b, c: Int64x2): Int64x2 =
[c[0] + a[0]*b[0], c[1] + a[1]*b[1]]
ukernel_generator(
x86_SSE2,
typ = int64,
vectype = Int64x2,
nb_scalars = 2,
simd_setZero = setZero_int64_sse2_fallback,
simd_broadcast_value = set1_int64_sse2_fallback,
simd_load_aligned = load_int64_sse2_fallback,
simd_load_unaligned = load_int64_sse2_fallback,
simd_store_unaligned = store_int64_sse2_fallback,
simd_mul = mul_int64_sse2_fallback,
simd_add = add_int64_sse2_fallback,
simd_fma = fma_int64_sse2_fallback
) Edit: Facebook FBGEMM also has int8 quantized matrix multiplication, but I think it's also for a sparse case: https://github.com/pytorch/FBGEMM |
@mratsim, have you stumboled over Google's From what I understood, the main ingredient for their avx2 based integer gemm is to have a more sophisticated packing routine. Instead of calculating the outer product between a column of the LHS matrix and a row of the RHS matrix, they perform an outer product + proper matrix multiplication at the same time. The reason for this is that avx2 only gives you 16 packed vector registers, while the outer product between two packed sign-extended 16 bit integers (sign-extended from 8 to 16 bit) requires 1616/8 registers (or 1632/8 even when accumulating in 32 bits). You thus run into the issue of permanent register spilling to get the data in and out. Here is a schematic to make it clearer:
They pack their memory in such a way that on the LHS you have 3 panels of 8x2 matrices adjacent. The RHS is packed such that you have panels of 2x4 matrices adjacent. You now perform the outer-product + matrix-multiplication between the RHS 3 times. This allows us to accumulate the results in 3x4 = 12 registers in total. this leaves 4 registers to load the panels from the LHS and RHS. |
Hello! |
Hi @ndattani. This effort stalled because I realized that I didn't actually need integer matrices for my implementation of Al-Mohy and Higham's Also, reading some other integer gemm implementations out there I realized that it's quite a bit more involved than floating point version because you need to have more complicated integer packing for the kernels to operate on. So I am sorry to disappoint you in this. I currently don't have any capacities to implement this. :( |
int32 x in32 -> int32, i.e. without mixed-precision accumulation is just replacing the float SIMD with integer SIMD and are much easier than int16 x int16 -> int32 (or f16 x f16 -> f32 for that matter). The main issue is having a fallback as int32 SIMD multiplication requires SSE4.1 and int64 SIMD multiplication requires AVX512 even when just using the SSE part of the code. |
cc @SuperFluffy #24
A place to discuss this