[WIP][DRAFT][onert] Optimized BatchMatMul in CPU backend #13919

jiwaszki · 2024-09-03T17:01:18Z

This commit introduces improved BMM kernel for CPU.

ONE-DCO-1.0-Signed-off-by: Jan Iwaszkiewicz [email protected]

This commit introduces improved BMM kernel for CPU. ONE-DCO-1.0-Signed-off-by: Jan Iwaszkiewicz <[email protected]>

jiwaszki · 2024-09-03T17:23:58Z

This is very rough draft that just injects into current implementation.

@ragmani as a creator of the issue, I want to ask for directions.
Current proposal is to use already exiting GEMM implementation that should be adjusted to work with BMM (here it means that runs along batches). Is it sufficient for the first level of optimization? Or should I switch to more custom implementation?
I also need to take a closer look on transposeRowsCols in BMM and the effect on data layout for the kernel.

I see that optimized::GEMM is used together with CKER_X86_PLATFORM. Is BMM also required to have two branches for X86+GEMM and others+GEMV (i.e. reuse something like MatrixBatchVectorMultiplyAccumulate)?

About shapes -- I only tested it out on perfectly square matrices, need to investigate it more with possible limitations of cker's GEMM.

As a note, here is simple average time (benchmark of 1000 runs):

Current	Proposal
~0.02820s	~0.00769s

PS @zetwhite I saw your comment #12140 (comment) but it's few months old. I am sorry that I took this task away without notice. Are you okay with me continuing it?

zetwhite · 2024-09-03T17:32:23Z

PS @zetwhite I saw your comment #12140 (comment) but it's few months old. I am sorry that I took this task away without notice. Are you okay with me continuing it?

No problem with me :) plz self assign for #12140

ragmani · 2024-09-04T04:37:09Z

@jiwaszki

Current proposal is to use already exiting GEMM implementation that should be adjusted to work with BMM (here it means that runs along batches). Is it sufficient for the first level of optimization? Or should I switch to more custom implementation?

I think it's sufficient for the first level of optimization. But, it's more important to optimize for arm than x64.

I also need to take a closer look on transposeRowsCols in BMM and the effect on data layout for the kernel.

AFAIK, circle and tflite model has the attributes adjoint_lhs and adjoint_rhs for BMM op. BMM' layouts of lhs and rhs are different depending on those attributes. We may need to implement optimized kernels that run according to those attributes.

Here is an example case.

I see that optimized::GEMM is used together with CKER_X86_PLATFORM. Is BMM also required to have two branches for X86+GEMM and others+GEMV (i.e. reuse something like MatrixBatchVectorMultiplyAccumulate)?

Yes, we also need optimzed kernels running on arm arch. I think GEMV's sufficient for the first level of optimization on arm.

[WIP][DRAFT][onert] Optimized BatchMatMul in CPU backend

b801d40

This commit introduces improved BMM kernel for CPU. ONE-DCO-1.0-Signed-off-by: Jan Iwaszkiewicz <[email protected]>

jiwaszki added PR/NO TEST Tell CI to not run test DRAFT A draft issue or PR for sharing one's current working status and discussion. type/discussion We need discussion. Discussion itself can help. Even without conclusions! labels Sep 3, 2024

tomdol mentioned this pull request Oct 18, 2024

[compute/cker] Optimize BMM for X86 #14238

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][DRAFT][onert] Optimized BatchMatMul in CPU backend #13919

[WIP][DRAFT][onert] Optimized BatchMatMul in CPU backend #13919

jiwaszki commented Sep 3, 2024

jiwaszki commented Sep 3, 2024 •

edited

Loading

zetwhite commented Sep 3, 2024

ragmani commented Sep 4, 2024 •

edited

Loading

[WIP][DRAFT][onert] Optimized BatchMatMul in CPU backend #13919

Are you sure you want to change the base?

[WIP][DRAFT][onert] Optimized BatchMatMul in CPU backend #13919

Conversation

jiwaszki commented Sep 3, 2024

jiwaszki commented Sep 3, 2024 • edited Loading

zetwhite commented Sep 3, 2024

ragmani commented Sep 4, 2024 • edited Loading

jiwaszki commented Sep 3, 2024 •

edited

Loading

ragmani commented Sep 4, 2024 •

edited

Loading