Keywords: Multiagent Systems, Agentic Systems, Post-training, Reinforcement Learning, Large Language Models, LLM Agents, Process Reward, LLM-as-a-Judge, Distributed Training
TL;DR: We propose agent-as-a-coach, a stateful LLM agent that leverages tool-augmented reasoning, persistent memory, and interact with the environment to provide dense, per-action reward signals for multiagent LLM finetuning
Abstract: Training multiagent LLM systems with reinforcement learning requires solving credit assignment: determining which agent’s actions led to success or failure in long, multi-turn trajectories. Process rewards from external LLM judges address this by scoring each agent’s actions rather than only the final outcome. However, existing judges are stateless: each evaluation is an independent LLM call with no memory of prior scores, no awareness of scoring drift, and no mechanism for self-calibration—even as thousands of evaluations accumulate over a
training run. We propose upgrading the stateless LLM-as-a-Judge to agent-as-a-coach: a tool-augmented LLM agent with persistent per-coach memory, rolling evaluation statistics, and a multi-turn reasoning loop. The coach queries its
own scoring history, detects task-type biases, writes qualitative notes for self-calibration, and passes messages via external memory. In preliminary experiments on DSBench with a three-agent data science pipeline (Qwen3-4B), our agent-as-a-coach approach produces adaptive rebalancing between classification and regression tasks: anti-correlated performance oscillations that a stateless judge lacking cross-evaluation awareness cannot generate.
Submission Number: 96
Loading