Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via sequence-level likelihood
Keywords: Large Language Model, Reinforcement learning, Entropy Regularization
Abstract: Group Relative Policy Optimization
(GRPO) has significantly advanced the
reasoning ability of large language models
(LLMs), particularly in their mathemat-
ical reasoning performance. However,
GRPO and related entropy regularization
methods still struggle with token-level
sparse-rewards, which is an inherent
challenge in chain-of-thought (CoT)
reasoning. These approaches often rely on
undifferentiated token-level entropy regu-
larization, which easily leads to entropy
collapse or model degradation under sparse
token rewards. In this work, we propose
TEPO, a novel token-level framework that
(1) leverages sequence-level likelihood to
link group-level rewards with individual
tokens via token-level aggregation, and (2)
introduces a token-level KL-Divergence
mask constraint that targets tokens
with positive advantages and decreasing
entropy to mitigate abrupt policy updates.
Experiments demonstrate that TEPO not
only achieves state-of-the-art performance
on mathematical reasoning benchmarks
but also markedly enhances training
stability, reducing convergence time by
50% compared with GRPO/DAPO.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Large Language Model, Reinforcement learning, Entropy Regularization
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 9881
Loading