Please give a repro on ray cluster setup? #9

kouroshHakha · 2025-01-26T21:59:36Z

I read the code, it seems like the current setup is using STRICT_SPREAD placement which will force the processes to get placed on different nodes (not something that works with 1/2 H100 boxes).

Also after changing STRIC_SPREAD to SPREAD I think the default provided shell script requires 8xH100 for actor + ref model allocation (assuming collocation) and 8 others for critic and 16 others for vllm actors? So a total of 32xH100s seems to be needed for this run. Is this correct?

HYZ17 · 2025-01-27T02:16:39Z

Hi, thank you for your question! We utilize 4 nodes to accelerate our experiments. You can adjust the allocation based on your needs by modifying the ref/actor/critic_num_gpus_per_node and vllm_num_engines settings. For instance, you can refer to this single-node configuration from the official OpenRLHF repository. As for the minimum requirements, I believe it would be 8x A100 or H100 GPUs.

ctjlewis · 2025-01-27T02:23:04Z

@HYZ17 4 nodes, each with 8x 80GB GPUs? Did you guys use A100s or H100s? Any details on hardware appreciated to reproduce.

HYZ17 · 2025-01-27T02:46:50Z

We use 4 nodes, each with 8 H100-80G GPUs, to train on 8K MATH examples for 2 days. The minimum hardware requirement is 6 H/A100-80G GPUs. (We haven't tested it yet)

For more detailed, please refer to here. Thank you !

kouroshHakha · 2025-01-27T17:16:16Z

I ended up using the following setup for 1 node on Qwen-1.5B with zero3 for training. It's slow but it runs. I also had to change all the STRIC_SPREAD strategies to SPREAD (I don't this part is necessary since num_gpus_per_node is 1 for distributed actors)

python3 openrlhf/cli/train_ppo_ray_box.py \
    --ref_num_nodes 1 \
    --ref_num_gpus_per_node 1 \
    --reward_num_nodes 0 \
    --reward_num_gpus_per_node 0 \
    --critic_num_nodes 1 \
    --critic_num_gpus_per_node 1 \
    --actor_num_nodes 1 \
    --actor_num_gpus_per_node 1 \
    --vllm_num_engines 4 \
    --vllm_tensor_parallel_size 1 \
    --colocate_actor_ref \
    --pretrain Qwen/Qwen2.5-Math-1.5B \
    ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please give a repro on ray cluster setup? #9

Please give a repro on ray cluster setup? #9

kouroshHakha commented Jan 26, 2025

HYZ17 commented Jan 27, 2025

ctjlewis commented Jan 27, 2025

HYZ17 commented Jan 27, 2025

kouroshHakha commented Jan 27, 2025 •

edited

Loading

Please give a repro on ray cluster setup? #9

Please give a repro on ray cluster setup? #9

Comments

kouroshHakha commented Jan 26, 2025

HYZ17 commented Jan 27, 2025

ctjlewis commented Jan 27, 2025

HYZ17 commented Jan 27, 2025

kouroshHakha commented Jan 27, 2025 • edited Loading

kouroshHakha commented Jan 27, 2025 •

edited

Loading