Dynamically load LoRA weights when using vLLM #2730

tgaddair · 2025-02-01T23:38:21Z

This PR implements the proposed improvement from #2725 and dynamically loads LoRA adapters into vLLM instead of merging LoRA weights back into the base model at each step. This will in practice be much faster and less memory intensive than merging.

The only caveat I would flag is that it does appear vLLM leaks host memory when dynamically loading LoRA adapters over and over. LoRAs are small, so this isn't necessarily going to cause failures, but the safest solution we've found when testing this internally has been to periodically recreate the LLM instance every 50 - 100 steps (this can safely be done within the same part of the code that writes out the LoRA checkpoint in my experience). Would be good to file an issue with vLLM team so they can investigate this at some point (@Jeffwan is this something you've encountered in your work on LoRA in vLLM?).

This is an adaptation of some code my team is using on a fork of TRL, so would be great if someone like @qgallouedec would be willing to commandeer and test out further to ensure everything is working as intended.

Use vLLM dynamic LoRA inference

37ee392

tgaddair mentioned this pull request Feb 1, 2025

⚡ Fix GRPO PEFT #2725

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamically load LoRA weights when using vLLM #2730

Dynamically load LoRA weights when using vLLM #2730

tgaddair commented Feb 1, 2025

Dynamically load LoRA weights when using vLLM #2730

Are you sure you want to change the base?

Dynamically load LoRA weights when using vLLM #2730

Conversation

tgaddair commented Feb 1, 2025