SLCA-GRPO: Resolving Cross-Segment Credit Misattribution in Tool-Calling RL
Abstract: Tool-calling agents produce heterogeneous outputs, interleaving structured tool invocations with user-facing natural language summaries. This output heterogeneity presents a structural failure mode in standard on-policy Reinforcement Learning (RL): algorithms like GRPO indiscriminately broadcast a homogeneous trajectory-level scalar advantage to all tokens. Consequently, gradient noise from summary generation leaks into tool-decision tokens, causing cross-segment credit misattribution and brittle optimization. In this work, we propose \textbf{SLCA-GRPO}, a framework incorporating \textbf{Segment-Locked Credit Assignment (SLCA)}. To enable scalable exploration without costly real APIs and stable training, we first construct the \textbf{Schema-Guided LLM Simulator (SGLS)} as a foundational training infrastructure. Building on this, SLCA introduces the \textbf{first} advantage estimation mechanism that explicitly decouples optimization objectives in tool-calling agents. Supported by \textbf{Hierarchical Rewards (HierR)}, SLCA routes execution advantages solely to tool tokens and preference advantages solely to summary tokens. This mathematically eliminates the gradient pathway for cross-segment credit misattribution. On a 7B backbone, SLCA-GRPO accelerates convergence and outperforms standard GRPO by +2.53pp on in-domain evaluation, +1.62 pp on the Berkeley Function-Calling Leaderboard (BFCL) and +6.33 pp on $\tau^2$-Bench under the same training budgets.
Loading