Smooth Gradients, Stable Learning: Logits Convexity for Reinforcement Learning

14 Sept 2025 (modified: 17 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Reinforcement Learning
TL;DR: We identify the property of logits convexity that smooths gradient magnitudes during optimization and enhances RL stability.
Abstract: Reinforcement learning (RL) has been pivotal to the recent success of large language models (LLMs) across a broad spectrum of tasks. However, RL optimization often suffers from inherent stability challenges, particularly when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective. We identify a property of the cross-entropy loss with softmax in SFT, which we term logits convexity, characterized by local convexity with respect to logits. Our theoretical analysis shows that logits convexity induces smoother gradient magnitudes during optimization, thereby enhancing stability. In contrast, the policy gradient objectives of widely used algorithms such as PPO and GRPO lack this property. Motivated by this insight, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization strategy to align the policy distribution with a carefully designed target distribution via KL divergence to emulate the stabilizing effects of logits convexity. Empirical results demonstrate that LCO improves stability and consistently outperforms conventional RL methods on both reasoning and non-reasoning benchmarks. Code and datasets will be made publicly available.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 5196
Loading