Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang; Le Yu; Chang Gao; Chujie Zheng; Shixuan Liu; Rui Lu; Kai Dang; Xiong-Hui Chen; Jianxin Yang; Zhenru Zhang; Yuqiong Liu; An Yang; Andrew Zhao; Yang Yue; Shiji Song; Bowen Yu; Gao Huang; Junyang Lin

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Reasoning, Reinforcement Learning, Entropy

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), yet its underlying mechanisms remain insufficiently understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction (approximately 20\%) of tokens exhibit high entropy, and these tokens semantically act as critical forks that steer the model toward diverse reasoning pathways. We further demonstrate that moderately increasing the entropy of these high-entropy tokens via decoding temperature adjustments leads to improved performance, quantitatively confirming their role as decision points in reasoning. We ultimately refine RLVR by restricting policy gradient updates to these forking tokens. Despite utilizing only 20\% of tokens, our approach achieves comparable performance to full-gradient updates on the Qwen3-8B base model. Moreover, it demonstrates remarkable improvements on the larger Qwen3-32B base model, boosting AIME'25 scores by 11.04 and AIME'24 scores by 7.71. In contrast, training exclusively on the 80\% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that dictate key reasoning directions. Collectively, our results suggest promising avenues for optimizing RLVR algorithms by strategically leveraging the potential of these high-entropy minority tokens to further enhance the reasoning abilities of LLMs.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 7106

Loading