Information-Tight Value-Loss Guarantees for Test-Time Committees in Cooperative MARL

TMLR Paper9728 Authors

13 Jun 2026 (modified: 19 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Cooperative multi-agent reinforcement learning (MARL) deployments increasingly spend test-time compute through committees of policy checkpoints, seeds, or ensemble advisors that vote on each agent's action. We study how to certify the team value-loss of such a frozen agreement-gated committee controller relative to a fixed reference policy $\pi^{\mathrm{ref}}$, using only deployment-time observable information. This is a certification problem for a frozen controller, not policy learning. We first show that per-agent marginal certification is invalid: its under-estimation compounds linearly with team size and disappears at $n = 1$, so the obstruction is genuinely multi-agent. A sequential counterexample then shows that a reference-prefix telescoping bound can strictly under-estimate the true loss; validity requires a joint occupancy-weighted certificate. Our main result is a range-aware information characterization. The finite-horizon return range supplies an information-independent ceiling $R_{\max} = H \Delta_r$, while deployment observables induce a chain of information terms $C_0 \ge C_1 \ge C_2$ over three nested information sets $I_0 \preceq I_1 \preceq I_2$. The unconditional guarantee is the chain $L \le R_{\max} \wedge C_2$, $R_{\max} \wedge C_2 \le R_{\max} \wedge C_1$, and $R_{\max} \wedge C_1 \le R_{\max} \wedge C_0$, where $L = J(\pi^{\mathrm{ref}}) - J(\pi^{\mathrm{ctrl}}_N)$ is the team value-loss. In the clean endorsement regime ($\eta = 0$), we establish profile-relative optimality over an explicit constructive witness class, together with pointwise sharpness of the pre-cap coordinate-local terms over all admissible unit laws. The carrier uses only the failure probability $g$, so it is agnostic to the committee's internal dependence structure and covers arbitrarily correlated advisors; logging the executed fallback action identity is what moves the worst-action certificate $C_1$ to the tighter logged-fallback certificate $C_2$. We then turn $C_2$ into a fresh-rollout, distribution-free $1 - \delta$ certificate with an explicit conservative value-bound construction, and a matching rare-unit lower bound. Exact cooperative Markov games verify validity and tightness against dynamic-programming truth, a conservative rollout-bridge experiment demonstrates valid certification under conservative rollout value bounds, and a tabular over-dispersion experiment confirms that a binomial plug-in under-covers on correlated committees while the dependence-agnostic certificate stays valid.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Jian_Li14
Submission Number: 9728
Loading