Keywords: large language model, token entropy
TL;DR: We propose a quantitative analysis framework for entropy change and analyze entropy interventions in LLMs
Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) can enhance LLM reasoning, its training process poses a critical risk: Entropy Collapse.
This phenomenon is a rapid loss of policy diversity, stemming from the exploration-exploitation imbalance and leading to suboptimal solutions.
Recent entropy-intervention methods aim to prevent this, yet their underlying mechanisms remain unclear.
In this paper, we conduct extensive experiments to reveal token-level entropy changes and how existing entropy intervention methods help avoid entropy collapse.
Our findings point out a fundamental limitation of existing methods: they attempt to control the entropy indirectly. By only adjusting related factors, such as the advantage signal and generation probability, their effectiveness is inherently limited and prone to failure.
To address this limitation, we introduce an entropy-change-aware reweighting scheme, namely **S**tabilizing **T**oken-level **E**ntropy-chang**E** via **R**eweighting (**STEER**), that adaptively stabilizes entropy dynamics through fine-grained, token-level adjustments. This approach prevents over-exploitation while ensuring robust exploration.
Our extensive experiments demonstrate that **STEER** significantly avoids entropy collapse, stabilizes entropy dynamics, and achieves stronger downstream performance across math reasoning benchmarks.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 302
Loading