Beyond the Stability-Exploration Dilemma: Environmental Regularization for LLM Policy Optimization

Beyond the Stability-Exploration Dilemma: Environmental Regularization for LLM Policy Optimization

ICLR 2026 Conference Submission16191 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning with Verifiable Rewards, Large Language Model, Math Reasoning

TL;DR: LLM policy optimization is unstable due to query distributional shift. We introduce Query-KL regularization to stabilize the query distribution, preventing training collapse and boosting performance on reasoning tasks.

Abstract: Policy optimization (PO) has advanced Large Language Models (LLMs), yet training remains constrained by a stability–exploration trade-off. We analyze the coupling between the input environment and the policy in LLM RL, and decouple parameter regularization from the optimization objective by moving regularization to the input side. Concretely, we propose **Environment-Regularized Policy Optimization (ERPO)**, instantiated with **Query-KL (QKL)**, which penalizes the KL divergence between the evolving query distribution and a fixed reference. By regularizing the input (query) distribution rather than the action (response) distribution, QKL indirectly controls policy drift induced by environmental shift while preserving exploration. To avoid premature convergence, we introduce a query-weighted advantage that reweights updates according to estimated query prevalence, reducing estimator variance and improving robustness. Across diverse mathematical reasoning benchmarks, ERPO achieves KL control comparable to methods with explicit policy regularization, while delivering stronger final performance and smoother training dynamics. Temperature-swept sampling further indicates more stable long-horizon behavior. These results suggest that making the input environment a first-class object—via QKL and query-weighted advantage— is a principled and practical route to improve the stability–exploration trade-off in PO for LLMs.

Primary Area: reinforcement learning

Submission Number: 16191

Loading