RAGEN-2: Reasoning Collapse in Agentic RL

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-turn reinforcement learning, LLM agents, PPO, GRPO, reasoning collapse, mutual information, conditional entropy, information-theoretic diagnostics, reward-variance filtering
TL;DR: We model reasoning collapse in LLM agent RL with a two-axis view of diversity, propose scalable estimators, and use reward-variance prompt-group SNR filtering to prevent prompt-agnostic template drift in LLM agent RL.
Abstract: RL training of multi-turn LLM agents is unstable, and reasoning quality drives task performance. Entropy, the standard reasoning-stability monitor, only measures within-input diversity and misses whether reasoning depends on the input. We identify template collapse: stable entropy alongside input-agnostic boilerplate, invisible to entropy and existing metrics. We diagnose it via a mutual-information (MI) proxy that scores cross-input distinguishability online; across tasks, MI correlates with final performance far more strongly than entropy. We then explain collapse via a signal-to-noise ratio (SNR) mechanism: low within-input reward variance weakens task gradients, letting input-agnostic regularization dominate and erase cross-input differences. We mitigate this with SNR-Aware Filtering, prioritizing high-variance prompts each iteration. Across planning, math reasoning, web navigation, and code execution, the method consistently improves input dependence and task performance.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 530
Loading