Keywords: LLM, Preference Learning, Importance Sampling, Online RLHF
Abstract: Training large language models (LLMs) with online sampled data can help off-policy preference optimization approaches like DPO learn better. Recent methods such as Statistical Rejection Sampling Optimization (RSO) have emerged as attractive alternatives to online Reinforcement Learning from Human Feedback (RLHF), offering improvements in stability and scalability. Although RSO has shown promising results by using rejection sampling to obtain preference data from the estimated optimal target policy, it faces computational inefficiencies due to the high rejection rates inherent in its sampling process. To address these limitations, we introduce **Importance Sampling Optimization** (ISO), a novel approach that achieves the benefits of sampling from the optimal policy distribution while significantly improving sample efficiency. ISO employs importance sampling to correct the distribution mismatch between the supervised fine-tuned (SFT) policy and the target optimal policy, enabling efficient use of all generated samples without rejection. Through extensive experiments across diverse tasks and models, we demonstrate that ISO achieves comparable or superior performance to RSO while requiring substantially fewer samples from the SFT policy. Reduces sampling overhead by up to 75\% while maintaining or improving win rates against both DPO and RSO baselines. Additionally, we show that ISO naturally extends to other preference optimization methods, providing a general framework for improving sample efficiency in preference learning.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13020
Loading