Token-Regulated Group Relative Policy Optimization for Stable Reinforcement Learning in Large Language Models

Tue Le; Nghi D. Q. Bui; Linh Ngo Van; Trung Le

Token-Regulated Group Relative Policy Optimization for Stable Reinforcement Learning in Large Language Models

Tue Le, Nghi D. Q. Bui, Linh Ngo Van, Trung Le

20 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Reinforcement Learning, Reasoning

TL;DR: We propose Token-Regulated GRPO, a reinforcement learning method that downweights low-probability tokens to stabilize training and consistently outperform GRPO across logic, math, and agentic reasoning tasks.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful approach for strengthening the reasoning capabilities of large language models (LLMs). Among existing algorithms, Group Relative Policy Optimization (GRPO) has demonstrated strong performance, yet it suffers from a critical issue: low-probability tokens disproportionately dominate gradient updates due to their inherently large gradient magnitudes. This imbalance leads to unstable training and suppresses the contribution of high-probability tokens that are more reliable for learning. In this work, we introduce **Token-Regulated Group Relative Policy Optimization (TR-GRPO)**, a simple yet effective extension of GRPO that assigns token-level weights positively correlated with the model’s predicted probability. By downweighting low-probability tokens and emphasizing high-probability ones, TR-GRPO mitigates gradient over-amplification while preserving informative learning signals. We provide theoretical analysis to show how token-level probability governs gradient norms which motivates our weighting design. Extensive experiments demonstrate that TR-GRPO consistently outperforms GRPO across RLVR tasks—including logic, math, and agentic reasoning—highlighting the importance of regulating token contributions during RL training and establishing TR-GRPO as a robust framework for enhancing LLM reasoning.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 24231

Loading