Keywords: Reinforcement Learning, Large Language Models, RL with Verifiable Rewards, Gradient Magnitude-based Token Selection, Mathematical Reasoning
Abstract: Reinforcement learning (RL) has recently emerged as a central paradigm for enhancing large language models' (LLMs) reasoning abilities. State-of-the-art RL with Verifiable Rewards (RLVR) methods have demonstrated remarkable effectiveness in mathematical reasoning tasks. Recent studies suggest that high-entropy tokens play an exceptionally important role in model training, since training with only the highest 20\% entropy tokens yields significant performance gains. In this work, we find that while high-entropy tokens within one answer tend to correlate with large gradient magnitude, entropy alone fails to consistently reflect token importance across different answers, considering the variations in the answer-level reward signals. Based on this observation, we introduce the **G**radient **M**agnitude-based **T**oken **S**election (GMTS) method to quantify tokens. We find that training with the top 20\% tokens ranked by GMTS achieves substantially better performance than entropy-based selection on well-known math benchmarks (**+1.55** on Qwen2.5-math-1.5B, **+1.33** on Qwen2.5-math-7B, **+1.85** on Qwen3-8B models). These findings indicate that GMTS provides a more refined quantification than entropy, thereby improving the performance of RLVR training.
Primary Area: reinforcement learning
Submission Number: 25206
Loading