[test] Add tests for SPMD vLLM #116

ZSL98 · 2025-01-18T11:16:01Z

This PR testing SPMD vLLM is in its early stages and not ready for immediate merging.
I am here just to confirm that the main-branch vLLM now works successfully with verl's test case (run_fsdp_vllm.py) and demonstrates compatibility. Below are some baseline comparisons for weight sync duration.

Configuration: 8*L20 GPUs, TP=4
A. Across Process(broadcast/gloo): run with python test_sync_weight_openrlhf.py with vllm_sync_backend = "gloo", rank 7 broadcast weights to rank0-3 with gloo backend.
B. Across Process(broadcast/nccl): run with python test_sync_weight_openrlhf.py with vllm_sync_backend = "nccl", rank 7 broadcast weights to rank0-3 with nccl backend.
C. FSDP+vLLM: the original test case, run 4 workers with torchrun --nproc-per-node=4 run_fsdp_vllm.py
D. FSDP+vLLM(spmd): using vllm='0.6.6.post2.dev252+g8027a724', run 4 workers with torchrun --nproc-per-node=4 run_fsdp_vllm_spmd.py

And the weight sync time (unit:second) is recorded as:

	Across Process(gloo)	Across Process(nccl)	FSDP+vLLM	FSDP+vLLM(spmd)
Qwen2.5-3B-Instruct	4.122498	1.020927	0.236113	0.232854
Qwen2.5-7B-Instruct	13.538473	0.789263	0.546817	0.548971
Meta-Llama-3-8B-Instruct	12.211754	0.700919	0.569790	0.568038

Note that the across-process weight sync only includes the broadcast component (the complete weight sync should also include weight gathering). FSDP+vLLM and FSDP+vLLM(spmd) should perform identically since the sync weight logic remains unchanged. Based on these results, I have two conclusions:

With SPMD vLLM now available, there's no need to explore cross-process weight synchronization methods.
I will test replacing the original vLLM with the new vLLM in the core logic to verify verl's compatibility.

ZSL98 · 2025-02-05T17:11:10Z

Moved to #209

zhangshulai and others added 12 commits January 17, 2025 16:24

[test] test for vllm-spmd

be4cd50

[test] test for sync weight in OpenRLHF style

d76c04d

[chore] Remove dependencies on vllm<=0.6.3

e20ba1b

[test] Add time profiling on vllm sync weight

ac4c91d

[test] Some formatting changes

6fb4999

Merge branch 'volcengine:main' into zsl/vllm-spmd

b64a473

Merge branch 'volcengine:main' into zsl/vllm-spmd

4fe511a

Merge branch 'volcengine:main' into zsl/vllm-spmd

c77bbec

Add a tiny version of run_qwen2-7b_seq_balance.sh

234a52d

init some files

bc689b6

Merge remote-tracking branch 'upstream/main' into zsl/vllm-spmd

6f55342

update

5a2d526

PeterSH6 mentioned this pull request Feb 1, 2025

[Question] Is vLLMRollout.generate_sequences the right place to implement tool calling? #176

Open

zhangshulai added 3 commits February 3, 2025 20:59

update

c0a5099

update

4a6d686

support fsdp

6c78554

ZSL98 closed this Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[test] Add tests for SPMD vLLM #116

[test] Add tests for SPMD vLLM #116

ZSL98 commented Jan 18, 2025 •

edited

Loading

ZSL98 commented Feb 5, 2025

[test] Add tests for SPMD vLLM #116

[test] Add tests for SPMD vLLM #116

Conversation

ZSL98 commented Jan 18, 2025 • edited Loading

ZSL98 commented Feb 5, 2025

ZSL98 commented Jan 18, 2025 •

edited

Loading