Universal Black-Box Targeted Reward Poisoning Attack Against Online Deep Reinforcement Learning

Universal Black-Box Targeted Reward Poisoning Attack Against Online Deep Reinforcement Learning

TMLR Paper5300 Authors

04 Jul 2025 (modified: 15 Oct 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: This work proposes the first universal black-box targeted attack against online reinforcement learning through reward poisoning during training time. Our attack is universally efficient against any efficient learning algorithm training in general RL environments and requires limited attack budgets and computational resources. We generalize a common feature of the efficient learning algorithms and assume that such algorithms would mostly take the optimal actions or actions close to them during training. We quantify the efficiency of an attack and propose an attack framework where it is feasible to evaluate the efficiency of any attack instance in the framework based on the assumption. Finally, we find an instance in the framework that requires a minimal per-step perturbation, which we call `adaptive target attack.' We theoretically analyze and prove a lower bound for the attack efficiency of our attack in the general RL setting. Empirically, on a diverse set of popular DRL environments learned by state-of-the-art DRL algorithms, we verify that our attack efficiently leads the learning agent to various target policies with limited budgets.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Michael_Bowling1

Submission Number: 5300

Loading