Exposing Vulnerabilities in RL: A Novel Stealthy Backdoor Attack through Reward Poisoning

Published: 22 Nov 2025, Last Modified: 22 Nov 2025SAFE-ROL PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reward poisoning, Reinforcement learning, Robot learning
TL;DR: To highlight the security issue of current deep reinforcement learning algorithm, we propose a novel backdoor attack algorithm which exhibits strong stealthiness and cause significant performance drop when trigger is activated
Abstract: Reinforcement learning (RL) has achieved remarkable success across diverse domains, enabling autonomous systems to learn and adapt to dynamic environments by optimizing a reward function. However, this reliance on reward signals creates a significant security vulnerability. In this paper, we study a novel stealthy backdoor attack that manipulates an agent's policy by poisoning its reward signals. The profound effectiveness of this algorithm demonstrates a critical threat to the integrity of deployed RL systems, calling for the community's urgent attention to develop robust defenses against such training-time manipulations. We evaluate the stealthy backdoor attack across both classic control and MuJoCo environments. In particular, the backdoored agent exhibits strong stealthiness in the \textit{Hopper} and \textit{Walker2D} environments, with minimal performance drops of only $2.18\%$ and $4.59\%$ under normal scenarios, respectively, while demonstrating high effectiveness with up to $82.31\%$ and $71.27\%$ performance declines under triggered scenarios.
Submission Number: 18
Loading