Exposing Vulnerabilities in RL: A Novel Stealthy Backdoor Attack through Reward Poisoning

Bokang Zhang; Chaojun Lu; LI JIANHUI; Junfeng Wu

Exposing Vulnerabilities in RL: A Novel Stealthy Backdoor Attack through Reward Poisoning

Bokang Zhang, Chaojun Lu, LI JIANHUI, Junfeng Wu

Published: 22 Nov 2025, Last Modified: 22 Nov 2025SAFE-ROL PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reward poisoning, Reinforcement learning, Robot learning

TL;DR: To highlight the security issue of current deep reinforcement learning algorithm, we propose a novel backdoor attack algorithm which exhibits strong stealthiness and cause significant performance drop when trigger is activated

Abstract: Reinforcement learning (RL) has achieved remarkable success across diverse domains, enabling autonomous systems to learn and adapt to dynamic environments by optimizing a reward function. However, this reliance on reward signals creates a significant security vulnerability. In this paper, we study a novel stealthy backdoor attack that manipulates an agent's policy by poisoning its reward signals. The profound effectiveness of this algorithm demonstrates a critical threat to the integrity of deployed RL systems, calling for the community's urgent attention to develop robust defenses against such training-time manipulations. We evaluate the stealthy backdoor attack across both classic control and MuJoCo environments. In particular, the backdoored agent exhibits strong stealthiness in the \textit{Hopper} and \textit{Walker2D} environments, with minimal performance drops of only $2.18\%$ and $4.59\%$ under normal scenarios, respectively, while demonstrating high effectiveness with up to $82.31\%$ and $71.27\%$ performance declines under triggered scenarios.

Submission Number: 18

Loading