Dynamically load LoRA weights when using vLLM #2730
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements the proposed improvement from #2725 and dynamically loads LoRA adapters into vLLM instead of merging LoRA weights back into the base model at each step. This will in practice be much faster and less memory intensive than merging.
The only caveat I would flag is that it does appear vLLM leaks host memory when dynamically loading LoRA adapters over and over. LoRAs are small, so this isn't necessarily going to cause failures, but the safest solution we've found when testing this internally has been to periodically recreate the
LLM
instance every 50 - 100 steps (this can safely be done within the same part of the code that writes out the LoRA checkpoint in my experience). Would be good to file an issue with vLLM team so they can investigate this at some point (@Jeffwan is this something you've encountered in your work on LoRA in vLLM?).This is an adaptation of some code my team is using on a fork of TRL, so would be great if someone like @qgallouedec would be willing to commandeer and test out further to ensure everything is working as intended.