-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921
base: master
Are you sure you want to change the base?
Conversation
e7974bb
to
c9c1afb
Compare
I think this is a step in the right direction, but I am not convinced about the current implementation. Generally, the way changes in tensor layout are intended to be implemented is through the ggml-backend buffer interface, it gives more control to the application over which tensors will be changed, it allows changes to the tensor size, and the conversion would be done at load time. Doing it this way may cause some tensors to be unintentionally converted, such as a quantized KV cache. However, the llama.cpp model loader at the moment does not have a good way to support this, but I am working on that at the moment. Note that there are AVX implementations for
patchdiff --git a/ggml/src/ggml-aarch64.c b/ggml/src/ggml-aarch64.c
index 700e66a0..4060d78e 100644
--- a/ggml/src/ggml-aarch64.c
+++ b/ggml/src/ggml-aarch64.c
@@ -3305,4 +3305,12 @@ void ggml_prepare_optimal_kernel(struct ggml_tensor *cur, uint8_t **pmem, size_t
}
}
#endif
+
+#if defined(__AVX2__) || defined(__AVX512F__)
+ if (cur->type == GGML_TYPE_Q4_0) {
+ if (repack_q4_0_to_q4_0_8_bl(cur, 8, pmem, psize) == 0) {
+ cur->type = GGML_TYPE_Q4_0_8_8;
+ }
+ }
+#endif
} |
@slaren Thank you for the review and valuable feedback! I understand the direction you're suggesting, particularly with aligning with the ggml-backend buffer interface. Could you provide more details on this approach? Specifically, I’m curious how it would integrate, given that the mulmat tensor is currently constructed during the graph build, which occurs after the model loader. I also wanted to ask about the timeline for the llama.cpp model loader improvements that would support this. If those changes aren’t expected to be completed soon, I suggest we merge the current PR with the necessary updates to ensure functionality in the short term. In parallel, we will start working on a more aligned implementation that integrates with the ggml-backend buffer interface. Please let me know your thoughts on this. |
I am working on the llama model loader at the moment. One of the changes that I will make is that it will be able to choose the buffer type to offload each tensor depending on the operations in which it will be used. This is mainly to prevent offloading tensors with types that are not supported by a backend, but it will also be useful for implementing this. It shouldn't take too long until this is merged. I think that this approach is too error prone to merge as it is. There are at least two cases that I am aware of that will not work:
Both of these will crash with this PR. It may be possible to fix these issues specifically, but fundamentally the problem is that by modifying the tensors in the backend, this breaks the assumptions that applications make about the way ggml uses the tensors. It would be a constant source of problems, and it will be hard for other ggml applications to take advantage of this. In the meanwhile, llama.cpp users can already take advantage of the performance boost of these types by converting the model beforehand using |
@slaren Thank you for your detailed feedback. I'll hold off on this PR and wait for your patch that allows the model loader to choose buffer types based on tensor operations. Once that is in place, I'll refactor my implementation accordingly. |
Added CPU backend online flow, allowing runtime requantization and repacking of Q4_0 to enable optimized GEMM and GEMV kernels. This feature can be enabled with the runtime option -rtrp (--runtime-repack).
Example of using the runtime option for benchmark on Graviton 3:
$./llama-bench -m phi-2.Q4_0.gguf -t 4 -rtrp 1,0
| model | size | params | backend | threads | repack | test | t/s |
|------------------------- | -----------| ----------|--- -------|---- -----|------ --|--- -------|- -------------|
| phi2 3B Q4_0 | 1.49 GiB | 2.78 B | CPU | 4 | 1 | pp512 | 110.84 ± 0.01 |
| phi2 3B Q4_0 | 1.49 GiB | 2.78 B | CPU | 4 | 1 | tg128 | 39.42 ± 0.02 |
| phi2 3B Q4_0 | 1.49 GiB | 2.78 B | CPU | 4 | 0 | pp512 | 38.03 ± 0.01 |
| phi2 3B Q4_0 | 1.49 GiB | 2.78 B | CPU | 4 | 0 | tg128 | 16.95 ± 0.01 |