Stealthy Backdoor Attack in Reinforcement Learning via Bi-level Optimization

ICLR 2026 Conference Submission21018 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Backdoor Attack, Reinforcement Learning, Bi-level Optimization
TL;DR: To highlight the security issue of current deep reinforcement learning algorithm, we propose a novel backdoor attack algorithm which exhibits strong stealthiness and cause significant performance drop when trigger is activated.
Abstract: Reinforcement learning (RL) has achieved remarkable success across diverse domains, enabling autonomous systems to learn and adapt to dynamic environments. However, the security and reliability of RL models remain significant concerns, especially given the growing threat of backdoor attacks. In this paper, we formalize backdoor attacks in RL as an optimization problem, offering a principled framework for analyzing and designing such attacks. Our approach uniquely emphasizes stealthiness by minimizing data distortions during RL training, and we propose a single-loop iterative algorithm based on a penalty-based bi-level reformulation to solve the optimization problem. The stealthiness and effectiveness of the backdoor are ensured through inequality constraints on $Q$-values, which prioritize malicious actions, and equality constraints that reflect the Bellman optimality conditions. We evaluate our stealthy backdoor attack across both classic control and MuJoCo environments. In particular, in the Hopper and Walker2D environments, the backdoored agent exhibits strong stealthiness, with minimal performance drops of only 2.18% and 4.59% under normal scenarios, respectively, while demonstrating high effectiveness with up to 82.31% and 71.27% declines under triggered scenarios.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21018
Loading