Token Hidden Reward: Steering Exploration-Exploitation in GRPO Training

Published: 09 Jul 2025, Last Modified: 16 Jul 2025AI4Math@ICML25 OralEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
Keywords: Large language model, Reasoning, Reinforcement Learning
Abstract: Reinforcement learning (RL) has substantially advanced the reasoning capabilities of large language models (LLMs), yet how to explicitly guide training toward exploration or exploitation remains underexplored. In this work, we start from the assumption that response confidence—the model’s likelihood assigned to correct responses—is a meaningful objective for reasoning tasks. To better understand and control learning under this objective, we analyze token-level dynamics in GRPO training and introduce Token Hidden Reward (THR), a novel metric that quantifies the contribution of individual tokens to response confidence. Based on THR, we propose a THR-guided reweighting strategy that modulates the learning signal to explicitly favor either high-confidence outputs (i.e., exploitation) or broader output diversity (i.e., exploration). Empirically, we find that increasing confidence mostly aligns with improved greedy decoding performance (exploitation), while encouraging lower-confidence increasing consistently boosts Pass$@K$ performance (exploration).
Submission Number: 86
Loading