Sample-efficient LLM Optimization with Reset Replay

Zichuan Liu; Jinyu Wang; Lei Song; Jiang Bian

Sample-efficient LLM Optimization with Reset Replay

Zichuan Liu, Jinyu Wang, Lei Song, Jiang Bian

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Sample Efficiency, Reinforcement Learning

TL;DR: This paper introduces LoRR, a plugin that boosts sample efficiency and mitigates overfitting in LLM preference optimization.

Abstract: Recent advancements in post-training Large Language Models (LLMs), particularly through Reinforcement Learning (RL) and preference optimization methods, are key drivers for enhancing their reasoning capabilities. However, these methods are often plagued by low sample efficiency and a susceptibility to primacy bias, where overfitting to initial experiences degrades policy quality and damages the learning process. To address these challenges, we introduce LLM optimization with Reset Replay (LoRR), a general and powerful plugin designed to enhance sample efficiency in any preference-based optimization framework. LoRR's core mechanism enables training at a high replay number, maximizing the utility of each collected data batch. To counteract the risk of overfitting inherent in high-replay training, LoRR incorporates a periodic reset strategy with reusing initial data, which preserves network plasticity. Furthermore, it leverages a hybrid optimization objective, combining Supervised Fine-Tuning (SFT) and preference-based losses to further bolster data exploitation. Our extensive experiments demonstrate that LoRR significantly boosts the performance of various preference optimization methods on both mathematical and general reasoning benchmarks. Notably, an iterative DPO approach augmented with LoRR achieves comparable performance on challenging math tasks, outperforming some complex and computationally intensive RL-based algorithms. These findings highlight that LoRR offers a practical and sample-efficient paradigm from limited offline data, unlocking greater performance with minimal changes to existing post-training workflows.

Supplementary Material: zip

Primary Area: optimization

Submission Number: 7896

Loading