Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test] Add tests for SPMD vLLM #116

Closed
wants to merge 15 commits into from
Closed

Conversation

ZSL98
Copy link

@ZSL98 ZSL98 commented Jan 18, 2025

This PR testing SPMD vLLM is in its early stages and not ready for immediate merging.
I am here just to confirm that the main-branch vLLM now works successfully with verl's test case (run_fsdp_vllm.py) and demonstrates compatibility. Below are some baseline comparisons for weight sync duration.

Configuration: 8*L20 GPUs, TP=4
A. Across Process(broadcast/gloo): run with python test_sync_weight_openrlhf.py with vllm_sync_backend = "gloo", rank 7 broadcast weights to rank0-3 with gloo backend.
B. Across Process(broadcast/nccl): run with python test_sync_weight_openrlhf.py with vllm_sync_backend = "nccl", rank 7 broadcast weights to rank0-3 with nccl backend.
C. FSDP+vLLM: the original test case, run 4 workers with torchrun --nproc-per-node=4 run_fsdp_vllm.py
D. FSDP+vLLM(spmd): using vllm='0.6.6.post2.dev252+g8027a724', run 4 workers with torchrun --nproc-per-node=4 run_fsdp_vllm_spmd.py

And the weight sync time (unit:second) is recorded as:

  Across Process(gloo) Across Process(nccl) FSDP+vLLM FSDP+vLLM(spmd)
Qwen2.5-3B-Instruct 4.122498 1.020927 0.236113 0.232854
Qwen2.5-7B-Instruct 13.538473 0.789263 0.546817 0.548971
Meta-Llama-3-8B-Instruct 12.211754 0.700919 0.569790 0.568038

Note that the across-process weight sync only includes the broadcast component (the complete weight sync should also include weight gathering). FSDP+vLLM and FSDP+vLLM(spmd) should perform identically since the sync weight logic remains unchanged. Based on these results, I have two conclusions:

  1. With SPMD vLLM now available, there's no need to explore cross-process weight synchronization methods.
  2. I will test replacing the original vLLM with the new vLLM in the core logic to verify verl's compatibility.

@ZSL98
Copy link
Author

ZSL98 commented Feb 5, 2025

Moved to #209

@ZSL98 ZSL98 closed this Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant