Understanding Reasoning Collapse in Multi-Turn Agent Reinforcement Learning

Zihan Wang; Chi Gui; Xing Jin; Qineng Wang; Licheng Liu; Kangrui Wang; Shiqi Chen; Linjie Li; Zhengyuan Yang; Pingyue Zhang; Yiping Lu; Jiajun Wu; Li Fei-Fei; Lijuan Wang; Yejin Choi; Manling Li

Understanding Reasoning Collapse in Multi-Turn Agent Reinforcement Learning

Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li

Published: 02 Mar 2026, Last Modified: 17 Mar 2026Agentic AI in the Wild: From Hallucinations to Reliable Autonomy PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multi-turn reinforcement learning, LLM agents, PPO, GRPO, reasoning collapse, mutual information, conditional entropy, information-theoretic diagnostics, reward-variance filtering

TL;DR: We model reasoning collapse in LLM agent RL with a two-axis view of diversity, propose scalable estimators, and use reward-variance prompt-group SNR filtering to prevent prompt-agnostic template drift in LLM agent RL.

Abstract: In closed-loop multi-turn agent reinforcement learning, LLM agents exhibit reasoning collapse, where reasoning shift toward generic templates, weakly coupled to the inputs. We firstly identify that such collapse is easy to miss with entropy or surface diversity metrics since reasoning text still varies but becomes input-agnostic. We then propose an information-theoretic decomposition of reasoning variable $Z$’s variation into conditional entropy $H(Z\mid X)$ (randomness under same input) and mutual information (MI) $I(X;Z)$ (input dependence). Template collapse occurs when $H(Z\mid X)$ stays high while $I(X;Z)$ drops, yielding diverse-looking but generic reasoning. To make $I(X;Z)$ a reproducible and sanity-checkable diagnostic, we further introduce an MI-style retrieval protocol treating each reasoning trace $Z$ as a query to retrieve its source $X$ from a minibatch; accuracy degrades toward chance under collapse. We thus provide a signal-to-noise ratio explanation for why $I(X;Z)$ drops: when within-input reward variance $\mathrm{Var}(R\mid X)$ is low, task gradients weaken and input-agnostic regularizers (KL, entropy) dominate, flattening cross-input differences. Finally, we propose reward-variance-aware filtering to prioritize high-signal updates. Across multi-turn environments, model scales, and modalities (including VLMs), this improves input dependence, stability, and performance while remaining competitive with state-of-the-art stabilization baselines.

Submission Number: 74

Loading