-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please give a repro on ray cluster setup? #9
Comments
Hi, thank you for your question! We utilize 4 nodes to accelerate our experiments. You can adjust the allocation based on your needs by modifying the ref/actor/critic_num_gpus_per_node and vllm_num_engines settings. For instance, you can refer to this single-node configuration from the official OpenRLHF repository. As for the minimum requirements, I believe it would be 8x A100 or H100 GPUs. |
@HYZ17 4 nodes, each with 8x 80GB GPUs? Did you guys use A100s or H100s? Any details on hardware appreciated to reproduce. |
We use 4 nodes, each with 8 H100-80G GPUs, to train on 8K MATH examples for 2 days. The minimum hardware requirement is 6 H/A100-80G GPUs. (We haven't tested it yet) For more detailed, please refer to here. Thank you ! |
I ended up using the following setup for 1 node on Qwen-1.5B with zero3 for training. It's slow but it runs. I also had to change all the STRIC_SPREAD strategies to SPREAD (I don't this part is necessary since num_gpus_per_node is 1 for distributed actors)
|
I read the code, it seems like the current setup is using STRICT_SPREAD placement which will force the processes to get placed on different nodes (not something that works with 1/2 H100 boxes).
Also after changing STRIC_SPREAD to SPREAD I think the default provided shell script requires 8xH100 for actor + ref model allocation (assuming collocation) and 8 others for critic and 16 others for vllm actors? So a total of 32xH100s seems to be needed for this run. Is this correct?
The text was updated successfully, but these errors were encountered: