-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group size and restrictions: documentation and implementation contradict each other #124
Comments
The readme is correct, you are not seeing a reduction in memory because your group-size is too small, you need to use a larger group-size. Ideally, your group-size should be a power of 2 (32, 64, 128, ...) or channel-wise via Why would you like to quantize a 1B model? The model is already too small, quantization is useful for larger models. |
Then assert is incorrect? As if "no restrictions as long as weight.numel() is divisible by the group_size" is correct, I must be able to use group size 38 for [8512, 2048]: 17432576 / 38 = 458752
Proof of concept to see what parts can be quantized making model smaller without making model too bad. |
Using a group-size of 38 is an edge-case which is incompatible with the fast backends. As a rule of thumb, all matrix/vector shapes should be ideally powers of 2, so try to choose from 32, 64, 128, etc. I can change the readme to only include powers of 2 / out_features size + mention the fast backends to avoid any confusion 👍 7B like a Llama2-7B should take ~5GB of VRAM and should run at 170 tokens/sec on a 4090. Would you mind sharing your code to quantize and run? I can assist you to run it properly, start with this example: https://github.com/mobiusml/hqq/blob/master/examples/backends/hqq_lib_demo.py |
Yeah, that would help avoid confusion
It actually would be helpful. In general non transformers documentation would be helpful as it's where hqq can shine the most. Where exllama and gguf can't be used easily if it all, with HQQ it's just possible to replace nn.Linears after loading the model without finetuning it. I'm targeting zamba2 adapted to HF transformers instead of their fork. Zamba2 7B version has 81 layers of mamba2 and mamba2 has two linears that make majority of weights. Also the culprit of the slowness was found: most backends(aten, torch, torch_compile). I didn't test backends pytorch/pytorch_compile on 7B, but on 1.2B they already were too slow, to the point if in each mamba block instead of quantizing in_proj and out_proj, I quantizied just in_proj, the speed doubled. Also I didn't see |
What you want is the backend for faster inference, which is enabled via As you said, model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True ) You'll have to run it for a couple of iterations to warm-up. I saw there are some people trying to refactor Mamba for torch.compile, so maybe that work would be useful. |
According to readme:
group_size (int): no restrictions as long as weight.numel() is divisible by the group_size however group_size should be divisible by 8
(Zamba2 1.2B has
torch.Size([8512, 2048])
linears. With readme in mind, I intentionally tried to makegroup size = 38
(64 was to imprecise, 32 didn't decrease memory too much).So readme probably should has this restriction for divisibility or there should be no assert in code.
The text was updated successfully, but these errors were encountered: