Is there a limit on the largest PPO model size and max context length the package can support to scale out of box? Is it scalable when we add more GPU nodes to the training job? Thanks! Sent from my iPhoneOn Jan 17, 2024, at 11:12 AM, HeyyyyyyG @.***> wrote: #88

okuchaiev · 2024-01-22T20:58:39Z

okuchaiev
Jan 22, 2024
Maintainer

is there a limit on the largest PPO model size and max context length the package can support to scale out of box? Is it scalable when we add more GPU nodes to the training job? Thanks! Sent from my iPhoneOn Jan 17, 2024, at 11:12 AM, HeyyyyyyG @.***> wrote:
Hi, we did PPO on Llama-70B model with 4k context length. In terms of GPU count, we used 32x8 GPUs for the actor and 8x8 GPUs for the critic. Try our NV-Llama2-70B-RLHF model on NVIDIA AI Foundation for free.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

Originally posted by @panyi121 in #70 (comment)

okuchaiev · 2024-01-22T20:59:28Z

okuchaiev
Jan 22, 2024
Maintainer Author

Pasting response from @odelalleau below:

NeMo-Aligner relies on Megatron-LM to support various parallelism schemes that allow scaling to very large models. We have yet to push it to its limits so it'd be hard to provide a reliable answer in terms of max size and max context length as of today. It is indeed scalable as we add more GPU nodes to the training job, but we have several optimizations in the works that we want to integrate before we do more extensive performance benchmarking.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a limit on the largest PPO model size and max context length the package can support to scale out of box? Is it scalable when we add more GPU nodes to the training job? Thanks! Sent from my iPhoneOn Jan 17, 2024, at 11:12 AM, HeyyyyyyG @.***> wrote: #88

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Is there a limit on the largest PPO model size and max context length the package can support to scale out of box? Is it scalable when we add more GPU nodes to the training job? Thanks! Sent from my iPhoneOn Jan 17, 2024, at 11:12 AM, HeyyyyyyG ***@***.***> wrote:﻿ #88

okuchaiev Jan 22, 2024 Maintainer

Replies: 1 comment

okuchaiev Jan 22, 2024 Maintainer Author

Is there a limit on the largest PPO model size and max context length the package can support to scale out of box? Is it scalable when we add more GPU nodes to the training job? Thanks! Sent from my iPhoneOn Jan 17, 2024, at 11:12 AM, HeyyyyyyG @.***> wrote: #88

okuchaiev
Jan 22, 2024
Maintainer

okuchaiev
Jan 22, 2024
Maintainer Author