Targeted Poisoning of Reinforcement Learning Agents

Published: 17 Jul 2025, Last Modified: 06 Sept 2025EWRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement learning, Trustworthy machine learning, Adversarial attacks
Abstract: Understanding the impact of adversarial attacks on reinforcement learning (RL) models is essential due to their wide range of applications. In this work, we initiate a study of targeted poisoning attacks for reinforcement learning agents, where the adversary aims to deliberately increase the likelihood of a specific undesirable event chosen by the attacker. In particular, rather than degrading overall performance indiscriminately, the adversary carefully manipulates the training process so that, during critical decision-making steps, the agent is more likely to fail in a targeted manner, leading it into the adversary’s desired outcome. We present theoretical results showing the effectiveness of such targeted poison- ing in basic RL settings. Building on these insights, we design practical attack strategies and thoroughly evaluate their impact beyond the scope of our theoretical analysis. Through extensive experiments, we demonstrate that targeted poisoning attacks substantially raise the probability of the chosen undesirable event across a variety of reinforcement learning tasks, ranging from classic control benchmarks to more complex continuous-control environments, including stochastic settings. We compare our attacks against standard RL baselines and against algorithms specifi- cally designed to mitigate poisoning, and we further validate their effectiveness on deep RL models. Our results highlight the vulnerabilities of RL systems to targeted training-time manipulations, underscoring the need for stronger defenses.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Omran_Shahbazi_Gholiabad1
Track: Regular Track: unpublished work
Submission Number: 121
Loading