TL;DR: This paper considers the problem of robustness against data corruption in multi-agent reinforcement learning from human feedback.
Abstract: We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong‐contamination model: given a dataset $D$ of trajectory–preference tuples (each preference being an $n$-dimensional binary label vector representing each of the $n$ agents’ preferences), an $\epsilon$-fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a \emph{uniform coverage} assumption—where every policy of interest is sufficiently represented in the clean (prior to corruption) data—we introduce a robust estimator that guarantees an $O(\epsilon^{1-o(1)})$ bound on the Nash‐equilibrium gap. Next, we move to the more challenging \emph{unilateral coverage} setting, in which only a Nash equilibrium and its single‐player deviations are covered: here our proposed algorithm achieves an $O(\sqrt{\epsilon})$ Nash‐gap bound. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to \emph{coarse correlated equilibria} (CCE). Under the same unilateral‐coverage regime, we then derive a quasi-polynomial‐time algorithm whose CCE gap scales as $O(\sqrt{\epsilon})$. To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.
Code Dataset Promise: No
Signed Copyright Form: pdf
Format Confirmation: I agree that I have read and followed the formatting instructions for the camera ready version.
Submission Number: 797
Loading