Keywords: Reinforcement learning, Trustworthy machine learning, Adversarial attacks
Abstract: Understanding the impact of adversarial attacks on reinforcement learning (RL)
models is essential due to their wide range of applications. In this work, we initiate
a study of targeted poisoning attacks for reinforcement learning agents, where the
adversary aims to deliberately increase the likelihood of a specific undesirable event
chosen by the attacker. In particular, rather than degrading overall performance
indiscriminately, the adversary carefully manipulates the training process so that,
during critical decision-making steps, the agent is more likely to fail in a targeted
manner, leading it into the adversary’s desired outcome.
We present theoretical results showing the effectiveness of such targeted poison-
ing in basic RL settings. Building on these insights, we design practical attack
strategies and thoroughly evaluate their impact beyond the scope of our theoretical
analysis. Through extensive experiments, we demonstrate that targeted poisoning
attacks substantially raise the probability of the chosen undesirable event across a
variety of reinforcement learning tasks, ranging from classic control benchmarks to
more complex continuous-control environments, including stochastic settings. We
compare our attacks against standard RL baselines and against algorithms specifi-
cally designed to mitigate poisoning, and we further validate their effectiveness on
deep RL models. Our results highlight the vulnerabilities of RL systems to targeted
training-time manipulations, underscoring the need for stronger defenses.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Omran_Shahbazi_Gholiabad1
Track: Regular Track: unpublished work
Submission Number: 121
Loading