Keywords: reinforcement learning, robustness, adversarial attacks, adversarial defense, imitation learning
TL;DR: We proposed an attack method aimed at forcing the victim to follow a specified behavior, as well as a defense method against such attacks.
Abstract: This study investigates the robustness of deep reinforcement learning agents against targeted attacks that aim to manipulate the victim's behavior through adversarial interventions in state observations. While several methods for such targeted manipulation attacks have been proposed, they all require white-box access to the victim's policy, and some rely on environment-specific heuristics. Furthermore, no defense method has been proposed to counter these attacks. To address this, we propose a novel targeted attack method for manipulating the victim, which does not depend on environmental heuristics and applies in black-box and no-box settings. Additionally, we introduce a defense strategy against these attacks. Our theoretical analysis proves that the sensitivity of a policy's action output to state changes affects the defense performance and that the earlier in the trajectory, the greater the effect. Based on this insight, we introduce a time-discounted regularization as a countermeasure for such behavior targeted attacks, which helps to improve robustness against attacks while maintaining task performance in the absence of attacks. Empirical evaluations demonstrate that our proposed attack method outperforms baseline attack methods. Furthermore, our defense strategy shows superior robustness against existing defense methods designed for untargeted attacks.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7285
Loading