Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong outputs with FP8 kv_cache reuse #2699

Open
2 of 4 tasks
lishicheng1996 opened this issue Jan 16, 2025 · 1 comment
Open
2 of 4 tasks

Wrong outputs with FP8 kv_cache reuse #2699

lishicheng1996 opened this issue Jan 16, 2025 · 1 comment
Assignees
Labels
bug Something isn't working Investigating KV-Cache Management triaged Issue has been triaged by maintainers

Comments

@lishicheng1996
Copy link

lishicheng1996 commented Jan 16, 2025

System Info

GPU: 4090
TRTLLM version: 0.16.0
Docker image: nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3

Who can help?

@Tracin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

1. Input contents

A csv file contain 200 same inputs.

test_input.csv

2. Prepare trtllm engine

2.1 Download a model

git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct

2.2 Quantize and build engine

python3 TensorRT-LLM/examples/quantization/quantize.py  \
                                      --model_dir  Qwen2.5-7B-Instruct \
                                      --output_dir Qwen2.5-7B-Instruct-ckpt \
                                      --qformat fp8 \
                                      --kv_cache_dtype fp8
trtllm-build --checkpoint_dir Qwen2.5-7B-Instruct-ckpt \
                    --output_dir Qwen2.5-7B-Instruct-engine  \
                    --gemm_plugin auto                   \
                    --use_paged_context_fmha enable       \
                    --use_fp8_context_fmha enable 

3. Run engine with run.py

3.1 Slightly modify run.py to see model output clearly

Modify the parse_input and print_output function in TensorRT-LLM/examples/run.py

Image

Image

3.2 Run with kvcache reuse

python3 TensorRT-LLM/examples/run.py  \
                     --input_file test_input.csv \
                     --output_csv output.csv \
                     --streaming \
                     --engine_dir Qwen2.5-7B-Instruct-engine  \
                     --tokenizer_dir  Qwen2.5-7B-Instruct \
                     --max_output_len 2048 \
                     --kv_cache_enable_block_reuse

3.3 Run without kvcache reuse

python3 TensorRT-LLM/examples/run.py  \
                     --input_file test_input.csv \
                     --output_csv output.csv \
                     --streaming \
                     --engine_dir Qwen2.5-7B-Instruct-engine  \
                     --tokenizer_dir  Qwen2.5-7B-Instruct \
                     --max_output_len 2048 

Expected behavior

Because the queries are totally same, begining of each outputs should be same. This is the correct output without kvcache reuse.

Image

See full correct outputs here

kvcache_reuse_off.xlsx

actual behavior

With kvcache reuse, the outputs are different and almost totally wrong.

Image

See full wrong outputs here

kvcache_reuse_on.xlsx

additional notes

FP16 kvcache reuse seems don't have this problem

@lishicheng1996 lishicheng1996 added the bug Something isn't working label Jan 16, 2025
@github-actions github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Jan 21, 2025
@nv-guomingz
Copy link
Collaborator

@lishicheng1996 thanks for reprorting this issue. we'll take a look firstly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Investigating KV-Cache Management triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants