$\mathbf{T^3}$: Reducing Belief Deviation in Reinforcement Learning for Active Reasoning

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0
Keywords: large language models, reinforcement learning, active reasoning
Abstract: Active reasoning requires Large language models (LLMs) to interact with external sources and gather missing information to solve a problem. Reinforcement learning with outcome reward, as a \textit{de facto} approach to incentivize active reasoning of LLMs, however, often loses track of problem states and generates uninformative and repetitive actions. Consequently, it leads to more and more belief deviation -- the divergence between the oracle belief and the agent’s internal belief state. To mitigate the issue, it is essential to properly assign rewards to and promote intermediate steps that are more purposeful and informative in solving the problem while avoiding being trapped by cumulative belief deviation. As directly tracking the deviation of belief states is intractable, we introduce $\mathbf{T^3}$, which leverages proxy signals of excessive belief deviation to assign intermediate rewards or directly truncates the rollout trajectories during training. Across two recent datasets tailored for active reasoning, $\mathbf{T^3}$ improves both performance and stability of diverse RL algorithms, achieving gains up to 30\%. These results highlight belief control as a key principle for training robust LLM-based active reasoners.
Submission Number: 115
Loading