Wrong outputs with FP8 kv_cache reuse #2699
Labels
bug
Something isn't working
Investigating
KV-Cache Management
triaged
Issue has been triaged by maintainers
System Info
GPU: 4090
TRTLLM version: 0.16.0
Docker image: nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
Who can help?
@Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1. Input contents
A csv file contain 200 same inputs.
test_input.csv
2. Prepare trtllm engine
2.1 Download a model
2.2 Quantize and build engine
3. Run engine with run.py
3.1 Slightly modify run.py to see model output clearly
Modify the parse_input and print_output function in TensorRT-LLM/examples/run.py
3.2 Run with kvcache reuse
3.3 Run without kvcache reuse
Expected behavior
Because the queries are totally same, begining of each outputs should be same. This is the correct output without kvcache reuse.
See full correct outputs here
kvcache_reuse_off.xlsx
actual behavior
With kvcache reuse, the outputs are different and almost totally wrong.
See full wrong outputs here
kvcache_reuse_on.xlsx
additional notes
FP16 kvcache reuse seems don't have this problem
The text was updated successfully, but these errors were encountered: