Wrong outputs with FP8 kv_cache reuse #2699

lishicheng1996 · 2025-01-16T11:58:48Z

System Info

GPU: 4090
TRTLLM version: 0.16.0
Docker image: nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3

Who can help?

@Tracin

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

1. Input contents

A csv file contain 200 same inputs.

test_input.csv

2. Prepare trtllm engine

2.1 Download a model

git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct

2.2 Quantize and build engine

python3 TensorRT-LLM/examples/quantization/quantize.py  \
                                      --model_dir  Qwen2.5-7B-Instruct \
                                      --output_dir Qwen2.5-7B-Instruct-ckpt \
                                      --qformat fp8 \
                                      --kv_cache_dtype fp8
trtllm-build --checkpoint_dir Qwen2.5-7B-Instruct-ckpt \
                    --output_dir Qwen2.5-7B-Instruct-engine  \
                    --gemm_plugin auto                   \
                    --use_paged_context_fmha enable       \
                    --use_fp8_context_fmha enable

3. Run engine with run.py

3.1 Slightly modify run.py to see model output clearly

Modify the parse_input and print_output function in TensorRT-LLM/examples/run.py

3.2 Run with kvcache reuse

python3 TensorRT-LLM/examples/run.py  \
                     --input_file test_input.csv \
                     --output_csv output.csv \
                     --streaming \
                     --engine_dir Qwen2.5-7B-Instruct-engine  \
                     --tokenizer_dir  Qwen2.5-7B-Instruct \
                     --max_output_len 2048 \
                     --kv_cache_enable_block_reuse

3.3 Run without kvcache reuse

python3 TensorRT-LLM/examples/run.py  \
                     --input_file test_input.csv \
                     --output_csv output.csv \
                     --streaming \
                     --engine_dir Qwen2.5-7B-Instruct-engine  \
                     --tokenizer_dir  Qwen2.5-7B-Instruct \
                     --max_output_len 2048

Expected behavior

Because the queries are totally same, begining of each outputs should be same. This is the correct output without kvcache reuse.

See full correct outputs here

kvcache_reuse_off.xlsx

actual behavior

With kvcache reuse, the outputs are different and almost totally wrong.

See full wrong outputs here

kvcache_reuse_on.xlsx

additional notes

FP16 kvcache reuse seems don't have this problem

The text was updated successfully, but these errors were encountered:

nv-guomingz · 2025-01-21T02:49:53Z

@lishicheng1996 thanks for reprorting this issue. we'll take a look firstly.

lishicheng1996 added the bug Something isn't working label Jan 16, 2025

kaiyux assigned Shixiaowei02 Jan 16, 2025

nv-guomingz added the KV-Cache Management label Jan 21, 2025

github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong outputs with FP8 kv_cache reuse #2699

Wrong outputs with FP8 kv_cache reuse #2699

lishicheng1996 commented Jan 16, 2025 •

edited

Loading

nv-guomingz commented Jan 21, 2025

Wrong outputs with FP8 kv_cache reuse #2699

Wrong outputs with FP8 kv_cache reuse #2699

Comments

lishicheng1996 commented Jan 16, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

1. Input contents

2. Prepare trtllm engine

2.1 Download a model

2.2 Quantize and build engine

3. Run engine with run.py

3.1 Slightly modify run.py to see model output clearly

3.2 Run with kvcache reuse

3.3 Run without kvcache reuse

Expected behavior

actual behavior

additional notes

nv-guomingz commented Jan 21, 2025

lishicheng1996 commented Jan 16, 2025 •

edited

Loading