Keywords: llm, importance sampling
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a key technique for aligning Large Language Models (LLMs) with human preferences. While Proximal Policy Optimization (PPO) is the standard algorithm, its reliance on a critic network incurs significant memory and computational costs. This has motivated the development of critic-free alternatives such as Group Relative Policy Optimization (GRPO) and Group Sequence Policy Optimization (GSPO). However, these methods suffer from a critical trade-off: they either employ theoretically unsound, high-variance estimators (GRPO) or introduce systematic bias to achieve stability, causing them to optimize a perturbed objective (GSPO). In this paper, we introduce SNIB (Self-Normalized Importance Sampling with a Baseline), a novel critic-free algorithm that addresses this dilemma by offering a method that is both stable and asymptotically correct. SNIB leverages principled self-normalized importance sampling to achieve the stability of modern methods without sacrificing asymptotic correctness. We provide a comprehensive theoretical analysis, proving that SNIB's gradient estimator is consistent and asymptotically unbiased. Furthermore, we demonstrate its superior robustness to reward model uncertainty and show that it preserves the principled trade-off between reward maximization and KL regularization, a property that is distorted by biased estimators. Our work establishes a theoretically-grounded foundation for building more stable and reliable critic-free RLHF algorithms.
Supplementary Material: pdf
Primary Area: reinforcement learning
Submission Number: 24114
Loading