Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

ICLR 2026 Conference Submission302 Authors

01 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language model, token entropy
TL;DR: We propose a quantitative analysis framework for entropy change and analyze entropy interventions in LLMs
Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) can enhance LLM reasoning, its training process poses a critical risk: Entropy Collapse. This phenomenon is a rapid loss of policy diversity, stemming from the exploration-exploitation imbalance and leading to suboptimal solutions. Recent entropy-intervention methods aim to prevent this, yet their underlying mechanisms remain unclear. In this paper, we conduct extensive experiments to reveal token-level entropy changes and how existing entropy intervention methods help avoid entropy collapse. Our findings point out a fundamental limitation of existing methods: they attempt to control the entropy indirectly. By only adjusting related factors, such as the advantage signal and generation probability, their effectiveness is inherently limited and prone to failure. To address this limitation, we introduce an entropy-change-aware reweighting scheme, namely **S**tabilizing **T**oken-level **E**ntropy-chang**E** via **R**eweighting (**STEER**), that adaptively stabilizes entropy dynamics through fine-grained, token-level adjustments. This approach prevents over-exploitation while ensuring robust exploration. Our extensive experiments demonstrate that **STEER** significantly avoids entropy collapse, stabilizes entropy dynamics, and achieves stronger downstream performance across math reasoning benchmarks.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 302
Loading