Both Local Validity and Global Effectiveness Matter: Decoupled Credit Assignment for Long‑Horizon Agentic Learning

ICLR 2026 Conference Submission17287 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agentic RL; Long‑Horizon Agent;
TL;DR: We fix RL for LLM agents by multiplying a "is the action valid?" signal with the "was the action good?" signal, instead of adding them, which stops the agent from being rewarded for writing bad code.
Abstract: The natural-language action space of Large Language Model (LLM) agents creates a real risk of invalid outputs (e.g., API rejections, parsing errors). Consequently, in Reinforcement Learning (RL) for long-horizon LLM agents, learning to generate a locally valid action in each turn is as crucial as selecting globally effective one. However, this requirement was overlooked by the prevailing additive paradigm for credit assignment in agentic RL. Specifically, it computes an action's credit by summing an estimated local score with the trajectory-level score. This paradigm assigns a ``contribution" score to all actions regardless of their validity, allowing invalid actions to be assigned positive credit, especially in positive trajectories. To address this, we propose Multiplicative Gated Rewards (MGR), which decouples local action-level validity from global effectiveness. MGR uses a fact-based validity signal, derived from direct environment feedback and syntactic validity, to determine the action-level score (e.g., $\pm$1). This score is then multiplied by the magnitude of the trajectory-level score. This ensures the action's validity strictly governs the reward's polarity, preventing credit misassignment. Experiments demonstrate that our method improves training stability and achieves SOTA performance on long-horizon LLM agent benchmarks. Code of MGR has been uploaded in the Supplementary Material.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 17287
Loading