Keywords: Reinforcement Learning
TL;DR: We propose to use truncated importance sampling (TIS) to fix rollout–training mismatch in modern RL systems, restoring stability and accuracy while enabling efficient quantized rollouts.
Abstract: Modern reinforcement learning (RL) systems aim to be efficient by employing hybrid designs for rollout generation (e.g., vLLM) and model training (e.g., FSDP). However, the implementation gap can implicitly turns on-policy RL into off-policy, as the rollout and training policies can produce significantly different token probabilities despite sharing the same model weights. We dive into this rollout-training mismatch problem and propose to use truncated importance sampling (TIS) as a simple yet effective fix. TIS applies importance sampling correction to bridge the distribution gap between rollout and training, enabling stable RL training even with quantized rollouts. We demonstrate TIS's effectiveness across multiple settings, showing it can preserve downstream performance while enabling significant speedups through rollout quantization. Our work provides algorithmic solution to address the systematic mismatch problem in efficient RL training.
Submission Number: 116
Loading