It’s Not You, It’s Clipping: A Soft Trust-Region via Probability Smoothing for LLM RL

20 Sept 2025 (modified: 17 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Policy Optimization, PPO, GRPO, Clipping, Trust Region, Probability Smoothing, Soft Trust Region, LLM, Reasoning, Mathematical Problem Solving, GRPO, fine-tuning.
Abstract: Training large language models (LLMs) with reinforcement learning (RL) methods such as PPO and GRPO commonly relies on ratio clipping to stabilise updates. While effective at preventing instability, clipping discards information and introduces gradient discontinuities. We propose Probability Smoothing Policy Optimisation (PSPO), which smooths the current policy’s probabilities toward the old (behaviour) policy before computing the importance ratio, analogous to label smoothing. Unlike clipping, PSPO preserves gradient signal, while interpolation toward the old policy creates a soft trust region that discourages large, destabilising updates, with formal guarantees. We instantiate PSPO within GRPO (GR-PSPO) and fine-tune Qwen2.5-0.5B/1.5B on GSM8K, evaluating on GSM8K test and the cross-dataset generalisation on SVAMP, ASDiv, and MATH-500. Relative to unclipped GRPO (single iteration; no data reuse, ratio always = 1), GR-PSPO attains similar accuracy but produces clearer, more concise, and more logically coherent responses (LLM-as-Judge). Compared to clipped GRPO, GR-PSPO substantially improves performance in both the 0.5B and 1.5B models, with a boost of over 20% on GSM8K (39.7% vs. 17.6% for 0.5B, 59.4% vs. 37.8% for 1.5B).
Primary Area: reinforcement learning
Submission Number: 24781
Loading