-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
int4 not faster than fp16 and fp8 #2487
Comments
@ShuaiShao93 Which version of TrtLLM do you use? Could you use the latest version? |
Yes it’s the latest 0.14.0 |
Why do you only generate a single output token? With that, performance will be dominated by prefill compute, which is roughly the same for each quant mode, no? int4 might even be slower here due to the additional dequant work? |
We use llm as a judge, so we only need it to answer yes or no. I think this is a common use case today and it's probably worth dedicated optimization.
Why is int4 not faster in prefill compute? Both IO and compute should still be faster than fp16 I guess? |
The int4 weights will first be dequantized to fp16 to run actual computations. Only fp8 is capable of avoiding that step, with specific config |
Doesn't L4 GPU have int4 tensor cores? Why do we have to dequant first?
fp8 is not faster than fp16 either |
Doing the math in int4 would not be accurate enough to produce useful results. They are trying something like it with fp4 on Blackwell but it seems to have questionable quality there also |
I see, thanks for the explanation! Does this mean fp8 is likely to be the fastest option for now? If so, why is it still slower than fp16 in the OP? |
@aikitoria is right. Weight-only quantization has benefit on generation stage. Prefill stage comparison is unfair. Please benchmark with gptManagerBenchmerk. |
Used bf16
fp8
int4
|
System Info
x86_64, Debian 11, L4 GPU
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
int4 faster than fp8 faster than bf16
actual behavior
bf16
fp8
int4
additional notes
The batched_input.txt has 2 inputs of 2k tokens
The text was updated successfully, but these errors were encountered: