SLCA-GRPO: Resolving Cross-Segment Credit Misattribution in Tool-Calling RL

Yan Zhan, Shaobo Liu, Qiunan Liu, Yuanjun Shi, Siqi Xu, WeiYi Hou, Xiang Xu, Zekang Li, Weizhou Pan, Jiahong Yan

Published: 29 Jan 2026, Last Modified: 25 Jan 2026OpenReview Archive Direct UploadEveryoneRevisionsCC BY 4.0

Abstract: Tool-calling agents produce heterogeneous outputs, interleaving structured tool invocations with user-facing natural language summaries. This output heterogeneity presents a structural failure mode in standard on-policy Reinforcement Learning (RL): algorithms like GRPO indiscriminately broadcast a homogeneous trajectory-level scalar advantage to all tokens. Consequently, gradient noise from summary generation leaks into tool-decision tokens, causing cross-segment credit misattribution and brittle optimization. In this work, we propose \textbf{SLCA-GRPO}, a framework incorporating \textbf{Segment-Locked Credit Assignment (SLCA)}. To enable scalable exploration without costly real APIs and stable training, we first construct the \textbf{Schema-Guided LLM Simulator (SGLS)} as a foundational training infrastructure. Building on this, SLCA introduces the \textbf{first} advantage estimation mechanism that explicitly decouples optimization objectives in tool-calling agents. Supported by \textbf{Hierarchical Rewards (HierR)}, SLCA routes execution advantages solely to tool tokens and preference advantages solely to summary tokens. This mathematically eliminates the gradient pathway for cross-segment credit misattribution. On a 7B backbone, SLCA-GRPO accelerates convergence and outperforms standard GRPO by +2.53pp on in-domain evaluation, +1.62 pp on the Berkeley Function-Calling Leaderboard (BFCL) and +6.33 pp on $\tau^2$-Bench under the same training budgets.