Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please give a repro on ray cluster setup? #9

Open
kouroshHakha opened this issue Jan 26, 2025 · 4 comments
Open

Please give a repro on ray cluster setup? #9

kouroshHakha opened this issue Jan 26, 2025 · 4 comments

Comments

@kouroshHakha
Copy link

I read the code, it seems like the current setup is using STRICT_SPREAD placement which will force the processes to get placed on different nodes (not something that works with 1/2 H100 boxes).

Also after changing STRIC_SPREAD to SPREAD I think the default provided shell script requires 8xH100 for actor + ref model allocation (assuming collocation) and 8 others for critic and 16 others for vllm actors? So a total of 32xH100s seems to be needed for this run. Is this correct?

@HYZ17
Copy link
Collaborator

HYZ17 commented Jan 27, 2025

Hi, thank you for your question! We utilize 4 nodes to accelerate our experiments. You can adjust the allocation based on your needs by modifying the ref/actor/critic_num_gpus_per_node and vllm_num_engines settings. For instance, you can refer to this single-node configuration from the official OpenRLHF repository. As for the minimum requirements, I believe it would be 8x A100 or H100 GPUs.

@ctjlewis
Copy link

@HYZ17 4 nodes, each with 8x 80GB GPUs? Did you guys use A100s or H100s? Any details on hardware appreciated to reproduce.

@HYZ17
Copy link
Collaborator

HYZ17 commented Jan 27, 2025

We use 4 nodes, each with 8 H100-80G GPUs, to train on 8K MATH examples for 2 days. The minimum hardware requirement is 6 H/A100-80G GPUs. (We haven't tested it yet)

For more detailed, please refer to here. Thank you !

@kouroshHakha
Copy link
Author

kouroshHakha commented Jan 27, 2025

I ended up using the following setup for 1 node on Qwen-1.5B with zero3 for training. It's slow but it runs. I also had to change all the STRIC_SPREAD strategies to SPREAD (I don't this part is necessary since num_gpus_per_node is 1 for distributed actors)

python3 openrlhf/cli/train_ppo_ray_box.py \
    --ref_num_nodes 1 \
    --ref_num_gpus_per_node 1 \
    --reward_num_nodes 0 \
    --reward_num_gpus_per_node 0 \
    --critic_num_nodes 1 \
    --critic_num_gpus_per_node 1 \
    --actor_num_nodes 1 \
    --actor_num_gpus_per_node 1 \
    --vllm_num_engines 4 \
    --vllm_tensor_parallel_size 1 \
    --colocate_actor_ref \
    --pretrain Qwen/Qwen2.5-Math-1.5B \
    ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants