Defending Against Unknown Corrupted Agents: Reinforcement Learning of Adversarially Robust Nash Equilibria

Andi Nika; Jonathan Nöther; Adish Singla; Goran Radanovic

Defending Against Unknown Corrupted Agents: Reinforcement Learning of Adversarially Robust Nash Equilibria

Andi Nika, Jonathan Nöther, Adish Singla, Goran Radanovic

Published: 16 Aug 2024, Last Modified: 09 May 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Event Certifications: iclr.cc/ICLR/2025/Journal_Track

Abstract: We consider a Multi-agent Reinforcement Learning (MARL) setting, in which an attacker can arbitrarily corrupt any subset of up to $k$ out of $n$ agents at deployment. Our goal is to design agents that are robust against such an attack, by accounting for the presence of corrupted agents at test time. To that end, we introduce a novel solution concept, the Adversarially Robust Nash Equilibrium (ARNEQ), and provide theoretical proof of its existence in general-sum Markov games. Furthermore, we introduce a proof-of-concept model-based approach to computing it and theoretically prove its convergence under standard assumptions. We also present a practical approach called Adversarially Robust Training (ART), an independent learning algorithm based on stochastic gradient descent ascent. Our experiments in both cooperative and mixed cooperative-competitive environments demonstrate ART's effectiveness and practical value in enhancing MARL resilience against adversarial behavior.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: All changes in the submission are in blue. On page 3, we have added the sentence "As a consequence, the problem is no longer a standard Markov game, and thus there is no NE defined in its solution space." and used the word "inappropriate". On page 4, we have added the sentence "Given action $a$, policy $\pi$ and a set $K\in \mathcal{N}$, we define $a_K=(a_j)_{j\in K}$ and $a_{-K}=(a_j)_{j\not\in K}$ for action $a$ and $\pi_K=(\pi_j)_{j\in K}$ and $\pi_{-K}=(\pi_j)_{j\not\in K}$ for policy $\pi$." On page 6, we have modified the definition of the $$\overline{V}^{t+1}_i(s) = \max_{\pi_i}\min_{\substack{\omega_i\in \Omega_i \\ \widehat{\pi}^i \in \Pi_{-i}}} \expp_{K\sim\omega_i}\Bigg[ \sum_{a\in\actions}\pi_i(a_i|s)\overline{\pi}^t_{-(K\cup\{i\})}(a_{-(K\cup\{i\})}|s)\widehat{\pi}^i_K(a_K|s) (\mathcal{B}\overline{V}^t_i)(s,a) \Bigg]$$. On page 7, we have added Assumption 3. On page 7, we have added the proof sketch of Theorem 2: "The main ingredient of the proof is utilizing a previous result from (Hu & Williams, 2003) which states that, if Assumptions 1 and 2 hold, and a given operator on the Q-functions is a contraction, then the procedure described above converges to an equilibrium. So the only thing to prove is that the ARNEQ operator is a contraction. We use Assumption 3 to that end, by separately considering both cases of the assumption. With this, all the conditions of the utilized result are satisfied, and thus we conclude convergence to an ARNEQ." On page 7, we have added the following paragraph: "Although the proposed procedure is simple and intuitive, with the crucial benefit of satisfying strong theoretical guarantees, there are also several drawbacks associated with it. First, note that the update rule given in Equation (2) requires knowledge of the equilibrium policies of the benign agents for the stage games in every iteration, which in turn requires knowledge of the Q-values of all agents, from every agent’s point of view. This is a downside that all centralized, value-based, algorithms in MARL, such as Nash Q-Learning, share. Second, even if knowledge of the Q-values of all agents can be guaranteed, the problem of computing a Nash equilibrium from given utilities in a general-sum Markov game is known to be computationally hard (Daskalakis et al., 2009). Finally, note that the theoretical guarantees of the proposed method depend on the stated assumptions. Such assumptions may not always be satisfied in practice, where the irregularities in the individual utilities do not need to satisfy saddle point or global optima conditions. Motivated by the above, our next goal is thus to find a more practical and efficient approach to finding ARNEQ policies. In the next section, we introduce a model-free gradient-based algorithm that is able to empirically provide an efficient defense in various MARL environments." On page 17, we have added the paragraph "This result has been classically used in proving the existence of Nash equilibria in various types of Markov games. It requires the careful construction of a set-valued function whose fixed point would represent the equilibrium of the game of interest. Once the construction is made, the theorem states that, if such a function satisfies some technical conditions, then the existence of its fixed point is guaranteed, thus effectively proving the existence of an equilibrium of the game. This will be our approach in the following. First, we will prove some auxiliary results related to properties such as contraction, continuity, and convexity. Then, we will construct a set-valued function whose fixed point would represent an ARNEQ and further show that it satisfies the technical conditions of Kakutani's fixed point theorem." On page 17, we have added the sentence "In this section we prove some auxiliary results which will be needed in the proof of Theorem 1." On page 21, we have added the paragraph "In this section, we will conclude the proof of Theorem \ref{thm:existence_result}. In order to do that, we will make use of the famous Kakutani's fixed point theorem which we state below. First, let us define the notion of upper semi-continuous functions, which is a precondition of this result." On page 21, we have added the paragraph "In the previous section, we have shown that the functions $\phi^i_s$ satisfy continuity. We will further show that the defined set function $\kappa$ satisfies the conditions of Kakutani’s theorem. Let us first restate Theorem 1 for convenience, and then proceed to its proof." On page 23, we have added the sentence "In order to prove the result lemma, we will make use of the following standard result on pseudo-contractions." and "Now we are ready to prove Lemma 8. In order to do so, we need to show that its conditions are satisfied for our setting." We have further slightly changed the order of two results.

Supplementary Material: zip

Assigned Action Editor: ~Thomy_Phan1

Submission Number: 2604

Loading