Keywords: LLM, RL
Abstract: To our knowledge, in the field of large language models, all existing reinforcement fine-tuning algorithms require generating a complete reasoning process starting from the question, which results in a substantial time overhead during the rollout phase of training.Challenging this conventional approach, we propose the assumption that during reinforcement fine-tuning, the model only needs to generate part of the reasoning process. We analyze the impact of different segments of the reasoning path on the correctness of the final result, and based on these insights, we introduce \textbf{Policy Optimization with Experience Replay (POER)}, a plug-and-play reinforcement fine-tuning algorithm. Unlike traditional reinforcement fine-tuning algorithms that generate full reasoning paths, POER trains the model by generating suffixes of the reasoning path using experience caching, thereby significantly reducing training time while improving training stability.From evaluations during the rollout phase of training, POER reduces token generation in this phase by approximately 95\%, greatly lowering the theoretical time overhead. In practical training, compared with full-path reinforcement fine-tuning algorithms, POER reduces the training time of the 1.5B model by 90\% and the 7B model by 72\%, while maintaining performance comparable to typical algorithms such as GRPO and DAPO.
We have open-sourced the code in an anonymous repository: \url{https://anonymous.4open.science/r/POER-4BF2}
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 12091
Loading