backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921

chaxu01 · 2024-10-17T08:09:36Z

Added CPU backend online flow, allowing runtime requantization and repacking of Q4_0 to enable optimized GEMM and GEMV kernels. This feature can be enabled with the runtime option -rtrp (--runtime-repack).

Example of using the runtime option for benchmark on Graviton 3:

$./llama-bench -m phi-2.Q4_0.gguf -t 4 -rtrp 1,0
| model | size | params | backend | threads | repack | test | t/s |
|------------------------- | -----------| ----------|--- -------|---- -----|------ --|--- -------|- -------------|
| phi2 3B Q4_0 | 1.49 GiB | 2.78 B | CPU | 4 | 1 | pp512 | 110.84 ± 0.01 |
| phi2 3B Q4_0 | 1.49 GiB | 2.78 B | CPU | 4 | 1 | tg128 | 39.42 ± 0.02 |
| phi2 3B Q4_0 | 1.49 GiB | 2.78 B | CPU | 4 | 0 | pp512 | 38.03 ± 0.01 |
| phi2 3B Q4_0 | 1.49 GiB | 2.78 B | CPU | 4 | 0 | tg128 | 16.95 ± 0.01 |

I have read the contributing guidelines
Self-reported review complexity:
- Low
- [x ] Medium
- High

slaren · 2024-10-17T23:25:05Z

I think this is a step in the right direction, but I am not convinced about the current implementation. Generally, the way changes in tensor layout are intended to be implemented is through the ggml-backend buffer interface, it gives more control to the application over which tensors will be changed, it allows changes to the tensor size, and the conversion would be done at load time. Doing it this way may cause some tensors to be unintentionally converted, such as a quantized KV cache. However, the llama.cpp model loader at the moment does not have a good way to support this, but I am working on that at the moment.

Note that there are AVX implementations for Q4_0_8_8 gemm, so with a small change this can also benefit x86 processors (tested on 13900k):

model	size	params	backend	threads	repack	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	0	pp512	50.56 ± 0.47
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	0	tg128	20.79 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	1	pp512	64.79 ± 0.32
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	1	tg128	15.80 ± 0.05

patch

diff --git a/ggml/src/ggml-aarch64.c b/ggml/src/ggml-aarch64.c
index 700e66a0..4060d78e 100644
--- a/ggml/src/ggml-aarch64.c
+++ b/ggml/src/ggml-aarch64.c
@@ -3305,4 +3305,12 @@ void ggml_prepare_optimal_kernel(struct ggml_tensor *cur, uint8_t **pmem, size_t
         }
     }
 #endif
+
+#if defined(__AVX2__) || defined(__AVX512F__)
+    if (cur->type == GGML_TYPE_Q4_0) {
+        if (repack_q4_0_to_q4_0_8_bl(cur, 8, pmem, psize) == 0) {
+            cur->type = GGML_TYPE_Q4_0_8_8;
+        }
+    }
+#endif
 }

chaxu01 · 2024-10-18T12:51:37Z

@slaren Thank you for the review and valuable feedback!

I understand the direction you're suggesting, particularly with aligning with the ggml-backend buffer interface. Could you provide more details on this approach? Specifically, I’m curious how it would integrate, given that the mulmat tensor is currently constructed during the graph build, which occurs after the model loader.

I also wanted to ask about the timeline for the llama.cpp model loader improvements that would support this. If those changes aren’t expected to be completed soon, I suggest we merge the current PR with the necessary updates to ensure functionality in the short term. In parallel, we will start working on a more aligned implementation that integrates with the ggml-backend buffer interface.

Please let me know your thoughts on this.

slaren · 2024-10-18T15:55:09Z

I am working on the llama model loader at the moment. One of the changes that I will make is that it will be able to choose the buffer type to offload each tensor depending on the operations in which it will be used. This is mainly to prevent offloading tensors with types that are not supported by a backend, but it will also be useful for implementing this. It shouldn't take too long until this is merged.

I think that this approach is too error prone to merge as it is. There are at least two cases that I am aware of that will not work:

Models with a shared token embeddings and output tensor. For example, try gemma-2-2b-it quantized to Q4_0 with --pure.
KV quantization with -ctk q4_0

Both of these will crash with this PR. It may be possible to fix these issues specifically, but fundamentally the problem is that by modifying the tensors in the backend, this breaks the assumptions that applications make about the way ggml uses the tensors. It would be a constant source of problems, and it will be hard for other ggml applications to take advantage of this.

In the meanwhile, llama.cpp users can already take advantage of the performance boost of these types by converting the model beforehand using llama-quantize.

chaxu01 · 2024-10-21T07:37:11Z

@slaren Thank you for your detailed feedback. I'll hold off on this PR and wait for your patch that allows the model loader to choose buffer types based on tensor operations. Once that is in place, I'll refactor my implementation accordingly.

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Oct 17, 2024

backend-cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

c9c1afb

chaxu01 force-pushed the feature/online-flow branch from e7974bb to c9c1afb Compare October 17, 2024 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921

backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921

chaxu01 commented Oct 17, 2024

slaren commented Oct 17, 2024

chaxu01 commented Oct 18, 2024

slaren commented Oct 18, 2024

chaxu01 commented Oct 21, 2024

backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921

Are you sure you want to change the base?

backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921

Conversation

chaxu01 commented Oct 17, 2024

slaren commented Oct 17, 2024

chaxu01 commented Oct 18, 2024

slaren commented Oct 18, 2024

chaxu01 commented Oct 21, 2024