Corruption-robust Offline Multi-agent Reinforcement Learning from Human Feedback
TL;DR: This paper considers the problem of robustness against data corruption in multi-agent reinforcement learning from human feedback.
Abstract: We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong‐contamination model: given a dataset $D$ of trajectory–preference tuples (each preference being an $n$-dimensional binary label vector representing each of the $n$ agents’ preferences), an $\epsilon$-fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a _uniform coverage_ assumption—where every policy of interest is sufficiently represented in $D$—we introduce a robust estimator that guarantees an $O(\epsilon^{1-o(1)})$ bound on the Nash‐equilibrium gap. Next, we move to the more challenging _unilateral coverage_ setting, in which only a Nash equilibrium and its single‐player deviations are covered: here our proposed algorithm achieves an $O(\sqrt{\epsilon})$ Nash‐gap bound. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to _coarse correlated equilibria_ (CCE). Under the same unilateral‐coverage regime, we then derive a quasi-polynomial‐time algorithm whose CCE gap scales as $O(\sqrt{\epsilon})$. To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.
Submission Number: 797
Loading