-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Reserve KV cache capacity after the first model run
Hugging Face models with separate branches for the first and subsequent iterations do not use the input KV cache buffer on the first run. Thus they did not benefit from the pre-allocated capacity and ended up re-allocating a new KV cache buffer on each run. To resolve this, change the KV cache growth strategy to grow the buffer after the model runs, if the capacity limit has been reached. Also replace the hard-coded capacity with a growth strategy that doubles the capacity each time. This amortizes the costs of copying the old KV cache into the new buffer.
- Loading branch information
1 parent
773f728
commit c24e15f
Showing
1 changed file
with
99 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters