veRL is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs).
veRL is the open-source version of HybridFlow: A Flexible and Efficient RLHF Framework paper.
veRL is flexible and easy to use with:
-
Easy extension of diverse RL algorithms: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
-
Seamless integration of existing LLM infra with modular APIs: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.
-
Flexible device mapping: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.
-
Readily integration with popular HuggingFace models
veRL is fast with:
-
State-of-the-art throughput: By seamlessly integrating existing SOTA LLM training and inference frameworks, veRL achieves high generation and training throughput.
-
Efficient actor model resharding with 3D-HybridEngine: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases.
| Documentation | Paper | Slack | Wechat |
- [2024/12] The team presented Post-training LLMs: From Algorithms to Infrastructure at NeurIPS 2024. Slides and video available.
- [2024/10] veRL is presented at Ray Summit. Youtube video available.
- [2024/08] HybridFlow (verl) is accepted to EuroSys 2025.
- FSDP and Megatron-LM for training.
- vLLM and TGI for rollout generation, SGLang support coming soon.
- huggingface models support
- Supervised fine-tuning
- Reinforcement learning from human feedback with PPO and GRPO
- Support model-based reward and function-based reward (verifiable reward)
- flash-attention integration, sequence packing, and long context support via DeepSpeed Ulysses
- scales up to 70B models and hundreds of GPUs
- experiment tracking with wandb and mlflow
- Reward model training
- DPO training
Checkout this Jupyter Notebook to get started with PPO training with a single 24GB L4 GPU (FREE GPU quota provided by Lighting Studio)!
Quickstart:
Running an PPO example step-by-step:
- Data and Reward Preparation
- Understanding the PPO Example
Reproducible algorithm baselines:
For code explanation and advance usage (extension):
- PPO Trainer and Workers
- Advance Usage and Extension
If you find the project helpful, please cite:
- HybridFlow: A Flexible and Efficient RLHF Framework
- A Framework for Training Large Language Models for Code Generation via Proximal Policy Optimization
@article{sheng2024hybridflow,
title = {HybridFlow: A Flexible and Efficient RLHF Framework},
author = {Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu},
year = {2024},
journal = {arXiv preprint arXiv: 2409.19256}
}
verl is inspired by the design of Nemo-Aligner, Deepspeed-chat and OpenRLHF. The project is adopted and supported by Anyscale, Bytedance, LMSys.org, Shanghai AI Lab, Tsinghua University, UC Berkeley, UCLA, UIUC, and University of Hong Kong.
- Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization
- Flaming-hot Initiation with Regular Execution Sampling for Large Language Models
- Process Reinforcement Through Implicit Rewards
We are HIRING! Send us an email if you are interested in internship/FTE opportunities in MLSys/LLM reasoning/multimodal alignment.