Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chaxu01
Copy link
Contributor

@chaxu01 chaxu01 commented Oct 17, 2024

Added CPU backend online flow, allowing runtime requantization and repacking of Q4_0 to enable optimized GEMM and GEMV kernels. This feature can be enabled with the runtime option -rtrp (--runtime-repack).

Example of using the runtime option for benchmark on Graviton 3:

$./llama-bench -m phi-2.Q4_0.gguf -t 4 -rtrp 1,0
| model | size | params | backend | threads | repack | test | t/s |
|------------------------- | -----------| ----------|--- -------|---- -----|------ --|--- -------|- -------------|
| phi2 3B Q4_0 | 1.49 GiB | 2.78 B | CPU | 4 | 1 | pp512 | 110.84 ± 0.01 |
| phi2 3B Q4_0 | 1.49 GiB | 2.78 B | CPU | 4 | 1 | tg128 | 39.42 ± 0.02 |
| phi2 3B Q4_0 | 1.49 GiB | 2.78 B | CPU | 4 | 0 | pp512 | 38.03 ± 0.01 |
| phi2 3B Q4_0 | 1.49 GiB | 2.78 B | CPU | 4 | 0 | tg128 | 16.95 ± 0.01 |

@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Oct 17, 2024
@slaren
Copy link
Collaborator

slaren commented Oct 17, 2024

I think this is a step in the right direction, but I am not convinced about the current implementation. Generally, the way changes in tensor layout are intended to be implemented is through the ggml-backend buffer interface, it gives more control to the application over which tensors will be changed, it allows changes to the tensor size, and the conversion would be done at load time. Doing it this way may cause some tensors to be unintentionally converted, such as a quantized KV cache. However, the llama.cpp model loader at the moment does not have a good way to support this, but I am working on that at the moment.

Note that there are AVX implementations for Q4_0_8_8 gemm, so with a small change this can also benefit x86 processors (tested on 13900k):

model size params backend threads repack test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 0 pp512 50.56 ± 0.47
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 0 tg128 20.79 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 1 pp512 64.79 ± 0.32
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 1 tg128 15.80 ± 0.05
patch
diff --git a/ggml/src/ggml-aarch64.c b/ggml/src/ggml-aarch64.c
index 700e66a0..4060d78e 100644
--- a/ggml/src/ggml-aarch64.c
+++ b/ggml/src/ggml-aarch64.c
@@ -3305,4 +3305,12 @@ void ggml_prepare_optimal_kernel(struct ggml_tensor *cur, uint8_t **pmem, size_t
         }
     }
 #endif
+
+#if defined(__AVX2__) || defined(__AVX512F__)
+    if (cur->type == GGML_TYPE_Q4_0) {
+        if (repack_q4_0_to_q4_0_8_bl(cur, 8, pmem, psize) == 0) {
+            cur->type = GGML_TYPE_Q4_0_8_8;
+        }
+    }
+#endif
 }

@chaxu01
Copy link
Contributor Author

chaxu01 commented Oct 18, 2024

@slaren Thank you for the review and valuable feedback!

I understand the direction you're suggesting, particularly with aligning with the ggml-backend buffer interface. Could you provide more details on this approach? Specifically, I’m curious how it would integrate, given that the mulmat tensor is currently constructed during the graph build, which occurs after the model loader.

I also wanted to ask about the timeline for the llama.cpp model loader improvements that would support this. If those changes aren’t expected to be completed soon, I suggest we merge the current PR with the necessary updates to ensure functionality in the short term. In parallel, we will start working on a more aligned implementation that integrates with the ggml-backend buffer interface.

Please let me know your thoughts on this.

@slaren
Copy link
Collaborator

slaren commented Oct 18, 2024

I am working on the llama model loader at the moment. One of the changes that I will make is that it will be able to choose the buffer type to offload each tensor depending on the operations in which it will be used. This is mainly to prevent offloading tensors with types that are not supported by a backend, but it will also be useful for implementing this. It shouldn't take too long until this is merged.

I think that this approach is too error prone to merge as it is. There are at least two cases that I am aware of that will not work:

  • Models with a shared token embeddings and output tensor. For example, try gemma-2-2b-it quantized to Q4_0 with --pure.
  • KV quantization with -ctk q4_0

Both of these will crash with this PR. It may be possible to fix these issues specifically, but fundamentally the problem is that by modifying the tensors in the backend, this breaks the assumptions that applications make about the way ggml uses the tensors. It would be a constant source of problems, and it will be hard for other ggml applications to take advantage of this.

In the meanwhile, llama.cpp users can already take advantage of the performance boost of these types by converting the model beforehand using llama-quantize.

@chaxu01
Copy link
Contributor Author

chaxu01 commented Oct 21, 2024

@slaren Thank you for your detailed feedback. I'll hold off on this PR and wait for your patch that allows the model loader to choose buffer types based on tensor operations. Once that is in place, I'll refactor my implementation accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants