Efficient RL Training for LLMs with Experience Replay

Charles Arnal; Vivien Cabannes; Taco Cohen; Julia Kempe; Rémi Munos

Efficient RL Training for LLMs with Experience Replay

Charles Arnal, Vivien Cabannes, Taco Cohen, Julia Kempe, Rémi Munos

Published: 03 Mar 2026, Last Modified: 03 Mar 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Experience Replay; Theory; Mid-scale Experiment Qwen 7B

TL;DR: Use a replay buffer to be more efficient when doing RL post-training of LLM

Abstract: While Experience Replay—the practice of storing rollouts and reusing them multiple times during training—is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading -- and in some cases even improving -- final model performance, while preserving policy entropy.

Submission Number: 28

Loading