Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.19.0 fast-forward merge #542

Merged
merged 26 commits into from
Nov 26, 2024
Merged

1.19.0 fast-forward merge #542

merged 26 commits into from
Nov 26, 2024

Conversation

kzawora-intel
Copy link

No description provided.

mfylcek and others added 22 commits November 15, 2024 14:56
Random sampler warmup
This change allows to skip empty steps in multistep scenario. We are
currently wasting host time on launching n-2 empty steps.
This PR removes it. The gain will be visible after device time
optimizations, as we are currently limited by HPU calculations inside
multistep.
Limit decode bucket size to num_hpu_blocks
Fixes issue with multi LoRA during `profile_run`.
We are seeing 10% performance regression in the llama-based model due to
vllm-project#10239. The mark_step()
function needs to be configured differently for each model to achieve
the best performance. For some models, mark_step() for every decoder
step would be optimal, but for other models, it's better to run it every
n-th step. We are adding a counter to only register the hook for every
n-th step, which can be configured with VLLM_CONFIG_HIDDEN_LAYERS
@michalkuligowski michalkuligowski merged commit 79e37ad into v1.19.0 Nov 26, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants