<h1 style="text-align: center;">veRL: Volcano Engine Reinforcement Learning for LLM</h1>

veRL is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs). 

veRL is the open-source version of **[HybridFlow: A Flexible and Efficient RLHF Framework](XXXX)** paper.

veRL is flexible and easy to use with:

- **Easy extension of diverse RL algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.

- **Seamless integration of existing LLM infra with modular APIs**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.

- **Flexible device mapping**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.

- Readily integration with popular HuggingFace models


veRL is fast with:

- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, veRL achieves high generation and training throughput.

- **Efficient actor model resharding with 3D-HybridEngine**: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases.

<p align="center">
| <a href="XXXX"><b>Documentation</b></a> | <a href="XXXX"><b>Paper</b></a> | <a href="XXXX"><b>Slack</b></a> | <a href="XXXX"><b>Wechat</b></a> | 

<!-- <a href=""><b>Slides</b></a> | -->
</p>

## News

- [2024/12] The team presented <a href="XXXX">Post-training LLMs: From Algorithms to Infrastructure</a> at NeurIPS 2024. [Slides](XXXX) and [video](XXXX) available.
- [2024/10] veRL is presented at Ray Summit. [Youtube video](XXXX) available.
- [2024/08] HybridFlow (verl) is accepted to EuroSys 2025.

## Key Features

- **FSDP** and **Megatron-LM** for training.
- **vLLM** and **TGI** for rollout generation, **SGLang** support coming soon.
- huggingface models support
- Supervised fine-tuning
- Reinforcement learning from human feedback with [PPO](XXXX) and [GRPO](XXXX)
  - Support model-based reward and function-based reward (verifiable reward)
- flash-attention integration, sequence packing, and long context support via DeepSpeed Ulysses
- scales up to 70B models and hundreds of GPUs
- experiment tracking with wandb and mlflow

## Upcoming Features
- Reward model training
- DPO training

## Getting Started

Checkout this [Jupyter Notebook](XXXX) to get started with PPO training with a single 24GB L4 GPU (**FREE** GPU quota provided by [Lighting Studio](XXXX))!

**Quickstart:**
- [Installation](XXXX)
- [Quickstart](XXXX)

**Running an PPO example step-by-step:**
- Data and Reward Preparation
  - [Prepare Data (Parquet) for Post-Training](XXXX)
  - [Implement Reward Function for Dataset](XXXX)
- Understanding the PPO Example
  - [PPO Example Architecture](XXXX)
  - [Config Explanation](XXXX)
  - [Run GSM8K Example](XXXX)

**Reproducible algorithm baselines:**
- [PPO](XXXX)

**For code explanation and advance usage (extension):**
- PPO Trainer and Workers
  - [PPO Ray Trainer](XXXX)
  - [PyTorch FSDP Backend](XXXX)
  - [Megatron-LM Backend](XXXX)
- Advance Usage and Extension
  - [Ray API Design Tutorial](XXXX)
  - [Extend to other RL(HF) algorithms](XXXX)
  - [Add models with the FSDP backend](XXXX)
  - [Add models with the Megatron-LM backend](XXXX)


## Citation and acknowledgement

If you find the project helpful, please cite:
- [HybridFlow: A Flexible and Efficient RLHF Framework](XXXX)
- [A Framework for Training Large Language Models for Code Generation via Proximal Policy Optimization](XXXX)

```tex
@article{sheng2024hybridflow,
  title   = {HybridFlow: A Flexible and Efficient RLHF Framework},
  author  = {Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2409.19256}
}
```

verl is inspired by the design of Nemo-Aligner, Deepspeed-chat and OpenRLHF. The project is adopted and supported by Anyscale, Bytedance, LMSys.org, Shanghai AI Lab, Tsinghua University, UC Berkeley, UCLA, UIUC, and University of Hong Kong.

## Publications Using veRL
- [Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization](XXXX)
- [Flaming-hot Initiation with Regular Execution Sampling for Large Language Models](XXXX)
- [Process Reinforcement Through Implicit Rewards](XXXX)

We are HIRING! Send us an [email](mailto:haibin.lin@bytedance.com) if you are interested in internship/FTE opportunities in MLSys/LLM reasoning/multimodal alignment.
