You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had a couple of questions related to exactly how the quantization is done:
Does bitsandbytes do fp16 -> int8 quantization after transferring the tensors to the GPU? And if you want to dequantize, are those operations done on the GPU as well?
Is the quantization method used absmax or zero point? Is it done row-wise? There is some mention of column-wise features, but when I load quantized models with huggingface the scale factors seem to be different for each row, not column.
When you quantize a model, do you treat outliers separately as described in the LLM.int8() paper? If so, then where does this happen in the source code?
This discussion was converted from issue #1161 on April 02, 2024 10:41.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I had a couple of questions related to exactly how the quantization is done:
Linear8BitLt()
, which leads me to believe that quantization is happening in this line - https://github.com/TimDettmers/bitsandbytes/blob/main/csrc/kernels.cu#L2419. Could someone please confirm this? If not, where is the quantization occuring?Beta Was this translation helpful? Give feedback.
All reactions