Provable Defense against Backdoor Policies in Reinforcement Learning

Shubham Kumar Bharti; Xuezhou Zhang; Adish Singla; Jerry Zhu

Provable Defense against Backdoor Policies in Reinforcement Learning

Shubham Kumar Bharti, Xuezhou Zhang, Adish Singla, Jerry Zhu

Published: 31 Oct 2022, Last Modified: 06 Apr 2025NeurIPS 2022 AcceptReaders: Everyone

Keywords: Adversarial Learning, Reinforcement Learning

TL;DR: We propose a provable defense mechanism against backdoor policies in reinforcement learning.

Abstract: We propose a provable defense mechanism against backdoor policies in reinforcement learning under subspace trigger assumption. A backdoor policy is a security threat where an adversary publishes a seemingly well-behaved policy which in fact allows hidden triggers. During deployment, the adversary can modify observed states in a particular way to trigger unexpected actions and harm the agent. We assume the agent does not have the resources to re-train a good policy. Instead, our defense mechanism sanitizes the backdoor policy by projecting observed states to a `safe subspace', estimated from a small number of interactions with a clean (non-triggered) environment. Our sanitized policy achieves $\epsilon$ approximate optimality in the presence of triggers, provided the number of clean interactions is $O\left(\frac{D}{(1-\gamma)^4 \epsilon^2}\right)$ where $\gamma$ is the discounting factor and $D$ is the dimension of state space. Empirically, we show that our sanitization defense performs well on two Atari game environments.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/provable-defense-against-backdoor-policies-in/code)

13 Replies

Loading