Defense Against Reward Poisoning Attacks in Reinforcement Learning
Abstract: We study defense strategies against reward poisoning attacks in reinforcement learning. As a threat model, we consider cost-effective targeted attacks---these attacks minimally alter rewards to make the attacker's target policy uniquely optimal under the poisoned rewards, with the optimality gap specified by an attack parameter. Our goal is to design agents that are robust against such attacks in terms of the worst-case utility w.r.t. the true, unpoisoned, rewards while computing their policies under the poisoned rewards. We propose an optimization framework for deriving optimal defense policies, both when the attack parameter is known and unknown. For this optimization framework, we first provide characterization results for generic attack cost functions. These results show that the functional form of the attack cost function and the agent's knowledge about it are critical for establishing lower bounds on the agent's performance, as well as for the computational tractability of the defense problem. We then focus on a cost function based on $\ell_2$ norm, for which we show that the defense problem can be efficiently solved and yields defense policies whose expected returns under the true rewards are lower bounded by their expected returns under the poison rewards. Using simulation-based experiments, we demonstrate the effectiveness and robustness of our defense approach.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: **Changes for the camera-ready submission**: - Added author names, affiliations, and acknowledgements. - Expanded on the proofs in Appendix H to improve clarity. **Changes in the review process**: We thank the reviewers for their valuable comments and suggestions. We have updated the paper accordingly and uploaded its revised version. Some of the most important changes include: - We have added a remark in Section 3.2, i.e., Remark 3.1, that explains the ergodicity assumption. - We have added an additional discussion related to the problem formulation in Section 3.3 (below problem (P2a)) and a figure (Figure 2 in the revised version) which illustrates the problem setting. - We have provided additional explanations of our theoretical results in Section 5.1 (at the beginning of this subsection), and have moved one figure from the appendix to the main part of the paper - this figure (Figure 3 in the revised version) illustrates the attack and defense strategies, thus further supporting our explanations. - We have provided additional discussions in Section 7 (paragraphs *Beyond the worst-case utility* and *Unknown-model and scalability*) related to extensions of our results (e.g., continuous state-spaces, suboptimality gap). - We have added a discussion related to the computational complexity of the approach in Section 5.1 (below Theorem 5.1). - We have added a discussion on how to compute occupancy measures in Section 5.1, below Theorem 5.1. - We have provided additional steps/clarifications to some of our proofs, following the reviewers’ suggestions (in particular, the proofs of Theorem 4.3 and Theorem 4.4). - We have added additional clarifications about implementation details in Section 5.1 (below Theorem 5.1), Section 6 (at the beginning of the section) and Appendix C.1. - We have added an additional experiment regarding the worst-case score of our policy in Appendix J. - We have edited the discussions in Section 7 to better describe the limitations of our work. In our response to the reviewers’ comments, we reference specific parts of the paper that were modified.
Assigned Action Editor: ~Olivier_Pietquin1
Submission Number: 453