[Performance] KV cache reuse is slower when batch size > 1 #2631

ReginaZh · 2024-12-26T09:20:42Z

I did some experiments about kv cache reuse. When batch size = 1, the engine inference latency will decrease as the length of the common prefix increases. However, when the batch size is greater than 1, no matter how long the common prefix is, the latency of enabling kv cache reuse is always greater than that of disabling kv cache reuse.

Here are scripts to reproduce my result:
build engine

cd examples/qwen
git clone https://huggingface.co/Qwen/Qwen2.5-0.5B
python ./convert_checkpoint.py --model_dir Qwen2.5-0.5B --output_dir ./tmp/sq0.5 --dtype float16 --smoothquant 0.5 --per_token --per_channel
trtllm-build --checkpoint_dir ./tmp/sq0.5 --output_dir ./trt_engines --gemm_plugin float16 --gpt_attention_plugin float16 --max_input_len 384 --max_seq_len 385 --max_batch_size 7 --gather_generation_logits --use_paged_context_fmha enable

engine inference

runner_kwargs = dict(engine_dir=engine_dir)
runner_kwargs.update(
                  max_batch_size=batch_size,
                  max_input_len=384,
                  max_output_len=1
              )
 runner_kwargs.update(kv_cache_enable_block_reuse=True)
 runner = ModelRunnerCpp.from_dir(**runner_kwargs)

    
 for batch in batches:
     batch_input_ids = [torch.IntTensor(inp) for inp in batch]
     outputs = runner.generate(
                     batch_input_ids=batch_input_ids,
                     max_new_tokens=1,
                     return_dict=True)

result on A100, max input length=384, max output length=1, common prefix token = 128

Setting	Latency
bs=1 enable kv cache reuse	5.03ms
bs=1 disable kv cache reuse	5.13ms
bs=7 enable kv cache reuse	33.16ms
bs=7 disable kv cache reuse	14.72ms

result on A100, max input length=384, max output length=1, common prefix token = 256

Setting	Latency
bs=1 enable kv cache reuse	4.56ms
bs=1 disable kv cache reuse	5.13ms
bs=7 enable kv cache reuse	30.29ms
bs=7 disable kv cache reuse	14.72ms

In addition, when I set the common prefix length equal to the input length—meaning the same request is used for inference—the latency remains at 4ms. Is this expected?

The text was updated successfully, but these errors were encountered:

nv-guomingz added the KV-Cache Management label Jan 8, 2025

github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] KV cache reuse is slower when batch size > 1 #2631

[Performance] KV cache reuse is slower when batch size > 1 #2631

ReginaZh commented Dec 26, 2024 •

edited

Loading

[Performance] KV cache reuse is slower when batch size > 1 #2631

[Performance] KV cache reuse is slower when batch size > 1 #2631

Comments

ReginaZh commented Dec 26, 2024 • edited Loading

ReginaZh commented Dec 26, 2024 •

edited

Loading