Importance Sampling Optimization Improves Online Preference Learning

Importance Sampling Optimization Improves Online Preference Learning

ICLR 2026 Conference Submission13020 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Preference Learning, Importance Sampling, Online RLHF

Abstract: Training large language models (LLMs) with online sampled data can help off-policy preference optimization approaches like DPO learn better. Recent methods such as Statistical Rejection Sampling Optimization (RSO) have emerged as attractive alternatives to online Reinforcement Learning from Human Feedback (RLHF), offering improvements in stability and scalability. Although RSO has shown promising results by using rejection sampling to obtain preference data from the estimated optimal target policy, it faces computational inefficiencies due to the high rejection rates inherent in its sampling process. To address these limitations, we introduce **Importance Sampling Optimization** (ISO), a novel approach that achieves the benefits of sampling from the optimal policy distribution while significantly improving sample efficiency. ISO employs importance sampling to correct the distribution mismatch between the supervised fine-tuned (SFT) policy and the target optimal policy, enabling efficient use of all generated samples without rejection. Through extensive experiments across diverse tasks and models, we demonstrate that ISO achieves comparable or superior performance to RSO while requiring substantially fewer samples from the SFT policy. Reduces sampling overhead by up to 75\% while maintaining or improving win rates against both DPO and RSO baselines. Additionally, we show that ISO naturally extends to other preference optimization methods, providing a general framework for improving sample efficiency in preference learning.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 13020

Loading