On the Rollout-Training Mismatch in Modern RL Systems

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER Workshop SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning
TL;DR: We fix the rollout-training mismatch problem in modern RL systems.
Abstract: Modern reinforcement learning (RL) systems pursue efficiency by adopting hybrid engines for rollout generation (e.g., vLLM) and model training (e.g., FSDP). Such implementation, while efficient, introduces a subtle rollout-training mismatch: even with the same model weights and architecture, the two backends can produce significantly different token probabilities, implicitly turning on-policy RL into off-policy. We address this problem with truncated importance sampling (TIS), a simple yet effective correction that bridges the distribution gap and stabilizes training, even under aggressive rollout quantization. Extensive experimental results show that TIS improves training quality in standard BF16 RL with hybrid engines and preserves downstream performance relative to BF16 rollouts when using 8-bit rollouts for speedup. Our work establishes an algorithmic foundation for efficient and effective RL training, advancing scalable reasoning systems.
Submission Number: 156
Loading