Adversarial Inception Backdoor Attacks against Reinforcement Learning

Ethan Rathbun; Alina Oprea; Christopher Amato

Adversarial Inception Backdoor Attacks against Reinforcement Learning

Ethan Rathbun, Alina Oprea, Christopher Amato

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: First class of backdoor attacks against DRL with theoretical guarantees of attack success under natural reward constraints

Abstract: Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. The objectives of these attacks are twofold: induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks assume arbitrary control over the agent's rewards, inducing values far outside the environment's natural constraints. This results in brittle attacks that fail once the proper reward constraints are enforced. Thus, in this work we propose a new class of backdoor attacks against DRL which are the first to achieve state of the art performance under strict reward constraints. These ``inception'' attacks manipulate the agent's training data -- inserting the trigger into prior observations and replacing high return actions with those of the targeted adversarial behavior. We formally define these attacks and prove they achieve both adversarial objectives against arbitrary Markov Decision Processes (MDP). Using this framework we devise an online inception attack which achieves an 100% attack success rate on multiple environments under constrained rewards while minimally impacting the agent's task performance.

Lay Summary: Reinforcement Learning systems are often assumed to be trained in environments in which the developer can trust their data source. This isn't always the case, however, as external entities, referred to as the "adversary", can potentially modify training data with malicious intent. In this paper we study one such class of attack called a "backdoor attack". Here the goal of the adversary is to perturb the training data of a reinforcement learning model, called the "agent", such that it exhibits predetermined behavior upon observing some set "trigger". In the case of robotic agents with camera sensors, the trigger may take the form of a specific QR code while the predetermined behavior may be to accelerate forward no matter the consequences. Previous studies have demonstrated the possibility of these attacks against reinforcement learning agents with strong assumptions that they can arbitrarily modify the agent's "reward" signal. The reward is what reinforcement learning agents use to determine which behavior is good and which is bad. This assumptions is easy to break in practice however, making prior attacks brittle. This may give practitioners a false sense of security that their systems are safe from backdoor attacks, however this is not the case. Therefore we propose a new class of attack which is able to be successful while minimally altering the agent's reward. This motivates further research into stronger defenses against backdoor attacks.

Link To Code: https://github.com/EthanRath/Backdoors-In-RL

Primary Area: Social Aspects->Security

Keywords: backdoor attacks, adversarial machine learning, reinforcement learning, poisoning attacks

Submission Number: 6992

Loading