On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Reinforcement learning (RL) has become a de facto paradigm for building LLM-based agents that act, interact, and reason over extended task horizons. However, in active reasoning where agents must elicit new observations through interaction with the environment to solve the task, we find that outcome-based RL can induce a systematic failure mode which we call information self-locking (SeL): agents fail both to elicit informative feedback and to internalize obtained evidence. To understand the issue, we trace agentic behaviors into two coupled capabilities: Action Selection (AS), which determines observation streams, and Belief Tracking (BT), which updates the agent’s internal task understanding. Theoretical and empirical analyses reveal a bidirectional bottleneck that leads to SeL: weak BT obscures the credit of informative actions, while weak AS deprives BT of useful evidence. This coupling weakens the learning signal for both capabilities and leads to SeL. To mitigate this issue, we propose AREW, a simple yet effective Advantage Reweighting method that uses easy-to-obtain directional critiques to reallocate credit within trajectories. Extensive experiments across 9 agentic tasks of varying complexity show that AREW significantly mitigates SeL, yielding up to 60-point gains in final performance. Code is available at https://github.com/unimpor/T3.
Lay Summary: Reinforcement learning (RL) has become a de facto paradigm for building LLM-based agents that act, interact, and reason over extended task horizons. However, in active reasoning, where agents must elicit new observations through interaction with the environment to solve the task, we find that outcome-based RL can induce a systematic failure mode which we call information self-locking (SeL): agents fail both to elicit informative feedback and to internalize obtained evidence. To understand the issue, we trace agentic behaviors into two coupled capabilities: Action Selection (AS), which determines observation streams, and Belief Tracking (BT), which updates the agent’s internal task understanding. Theoretical and empirical analyses reveal a bidirectional bottleneck that leads to SeL: weak BT obscures the credit of informative actions, while weak AS deprives BT of useful evidence. This coupling weakens the learning signal for both capabilities and leads to SeL. To mitigate this issue, we propose AREW, a simple yet effective Advantage-Reweighting method that uses easy-to-obtain directional critiques to reallocate credit within trajectories. Extensive experiments across 9 tasks of varying complexity show that AREW significantly mitigates SeL, yielding up to 60-point gains in final performance. Code is available at https://github.com/unimpor/T3.
Primary Area: Deep Learning->Large Language Models
Keywords: large language models, agentic reinforcement learning, agentic active reasoning
Originally Submitted PDF: pdf
Submission Number: 29063
Loading