Quantizing models with bitsandbytes #1163

raunaks13 · 2024-03-29T21:05:19Z

raunaks13
Mar 29, 2024

I had a couple of questions related to exactly how the quantization is done:

Does bitsandbytes do fp16 -> int8 quantization after transferring the tensors to the GPU? And if you want to dequantize, are those operations done on the GPU as well?
I traced the workflow of of Linear8BitLt(), which leads me to believe that quantization is happening in this line - https://github.com/TimDettmers/bitsandbytes/blob/main/csrc/kernels.cu#L2419. Could someone please confirm this? If not, where is the quantization occuring?
Is the quantization method used absmax or zero point? Is it done row-wise? There is some mention of column-wise features, but when I load quantized models with huggingface the scale factors seem to be different for each row, not column.
When you quantize a model, do you treat outliers separately as described in the LLM.int8() paper? If so, then where does this happen in the source code?