Corruption-robust Offline Multi-agent Reinforcement Learning from Human Feedback

Andi Nika; Debmalya Mandal; Parameswaran Kamalaruban; Adish Singla; Goran Radanovic

Corruption-robust Offline Multi-agent Reinforcement Learning from Human Feedback

Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban, Adish Singla, Goran Radanovic

Published: 03 Feb 2026, Last Modified: 03 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This paper considers the problem of robustness against data corruption in multi-agent reinforcement learning from human feedback.

Abstract: We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong‐contamination model: given a dataset $D$ of trajectory–preference tuples (each preference being an $n$-dimensional binary label vector representing each of the $n$ agents’ preferences), an $\epsilon$-fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a _uniform coverage_ assumption—where every policy of interest is sufficiently represented in $D$—we introduce a robust estimator that guarantees an $O(\epsilon^{1-o(1)})$ bound on the Nash‐equilibrium gap. Next, we move to the more challenging _unilateral coverage_ setting, in which only a Nash equilibrium and its single‐player deviations are covered: here our proposed algorithm achieves an $O(\sqrt{\epsilon})$ Nash‐gap bound. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to _coarse correlated equilibria_ (CCE). Under the same unilateral‐coverage regime, we then derive a quasi-polynomial‐time algorithm whose CCE gap scales as $O(\sqrt{\epsilon})$. To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.

Submission Number: 797

Loading