You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I did some experiments about kv cache reuse. When batch size = 1, the engine inference latency will decrease as the length of the common prefix increases. However, when the batch size is greater than 1, no matter how long the common prefix is, the latency of enabling kv cache reuse is always greater than that of disabling kv cache reuse.
Here are scripts to reproduce my result:
build engine
runner_kwargs = dict(engine_dir=engine_dir)
runner_kwargs.update(
max_batch_size=batch_size,
max_input_len=384,
max_output_len=1
)
runner_kwargs.update(kv_cache_enable_block_reuse=True)
runner = ModelRunnerCpp.from_dir(**runner_kwargs)
for batch in batches:
batch_input_ids = [torch.IntTensor(inp) for inp in batch]
outputs = runner.generate(
batch_input_ids=batch_input_ids,
max_new_tokens=1,
return_dict=True)
result on A100, max input length=384, max output length=1, common prefix token = 128
Setting
Latency
bs=1 enable kv cache reuse
5.03ms
bs=1 disable kv cache reuse
5.13ms
bs=7 enable kv cache reuse
33.16ms
bs=7 disable kv cache reuse
14.72ms
result on A100, max input length=384, max output length=1, common prefix token = 256
Setting
Latency
bs=1 enable kv cache reuse
4.56ms
bs=1 disable kv cache reuse
5.13ms
bs=7 enable kv cache reuse
30.29ms
bs=7 disable kv cache reuse
14.72ms
In addition, when I set the common prefix length equal to the input length—meaning the same request is used for inference—the latency remains at 4ms. Is this expected?
The text was updated successfully, but these errors were encountered:
I did some experiments about kv cache reuse. When batch size = 1, the engine inference latency will decrease as the length of the common prefix increases. However, when the batch size is greater than 1, no matter how long the common prefix is, the latency of enabling kv cache reuse is always greater than that of disabling kv cache reuse.
Here are scripts to reproduce my result:
build engine
engine inference
result on A100, max input length=384, max output length=1, common prefix token = 128
result on A100, max input length=384, max output length=1, common prefix token = 256
In addition, when I set the common prefix length equal to the input length—meaning the same request is used for inference—the latency remains at 4ms. Is this expected?
The text was updated successfully, but these errors were encountered: