Who Gets the Reward & Who Gets the Blame? Evaluation-Aligned Post-Training for Multi-LLM Agents

Published: 23 Sept 2025, Last Modified: 22 Nov 2025LAWEveryoneRevisionsBibTeXCC BY-NC 4.0
Keywords: multi-LLM agents, evaluation-alignedpost-training, game-theoretic credit assignment, multi-agent reinforcement learning
TL;DR: We propose an evaluation-aligned framework that maps system outcomes to agent and response signals, enabling localized post-training for multi-LLM systems with interpretable artifacts and standard objectives.
Abstract: Teams of LLM agents are increasingly used for complex tasks, yet post-training of these multi-agent systems (MAS) is often outcome-only and ad hoc: effective learning requires localized signals that specify which agent and which action to adjust. We introduce an evaluation-aligned framework that maps “system evaluation → agent credit → response-level signals,” closing the loop between evaluation and training. In successful episodes, the system reward is attributed fairly to agents and then distributed across their steps using context-aware labels; in failures, we localize the first error in the interaction trace to construct preference comparisons. These signals drop into standard post-training (value-free policy gradients with KL on success; preference learning on failure) without bespoke algorithms. Progress is tracked by the same evaluator that defines the signals (e.g., success rates and scores), with optional surrogate judges to reduce cost. The result is a simple, task- and model-agnostic recipe that yields interpretable artifacts (agent contributions, step helpfulness) and turns routine operation into training data. As a conceptual paper, we formalize this mapping and protocol.
Submission Type: Position/Review Paper (4-9 Pages)
Submission Number: 99
Loading