Rethinking Entropy Interventions in RLVR:  An Entropy Change Perspective

Zhezheng Hao; Hong Wang; Haoyang Liu; Jian Luo; Jiarui Yu; Hande Dong; Cheaterlin; Can Wang; Jiawei Chen

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

Zhezheng Hao, Hong Wang, Haoyang Liu, Jian Luo, Jiarui Yu, Hande Dong, Cheaterlin, Can Wang, Jiawei Chen

01 Sept 2025 (modified: 30 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language model, token entropy

TL;DR: We propose a quantitative analysis framework for entropy change and analyze entropy interventions in LLMs

Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) can enhance LLM reasoning, its training process poses a critical risk: Entropy Collapse. This phenomenon is a rapid loss of policy diversity, stemming from the exploration-exploitation imbalance and leading to suboptimal solutions. Recent entropy-intervention methods aim to prevent this, yet their underlying mechanisms remain unclear. In this paper, we conduct extensive experiments to reveal token-level entropy changes and how existing entropy intervention methods help avoid entropy collapse. Our findings point out a fundamental limitation of existing methods: they attempt to control the entropy indirectly. By only adjusting related factors, such as the advantage signal and generation probability, their effectiveness is inherently limited and prone to failure. To address this limitation, we introduce an entropy-change-aware reweighting scheme, namely **S**tabilizing **T**oken-level **E**ntropy-chang**E** via **R**eweighting (**STEER**), that adaptively stabilizes entropy dynamics through fine-grained, token-level adjustments. This approach prevents over-exploitation while ensuring robust exploration. Our extensive experiments demonstrate that **STEER** significantly avoids entropy collapse, stabilizes entropy dynamics, and achieves stronger downstream performance across math reasoning benchmarks.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 302

Loading