Universal Trojan Signatures in Reinforcement Learning

Published: 28 Oct 2023, Last Modified: 13 Mar 2024NeurIPS 2023 BUGS PosterEveryoneRevisionsBibTeX
Keywords: Trojan, RL, reinforcement learning, backdoor attacks, attribution, Jacobian
TL;DR: We use attribution analysis to detect trojan models which can generalize to transfer settings with novel RL environments and modified architectures.
Abstract: We present a novel approach for characterizing Trojaned reinforcement learning (RL) agents. By monitoring for discrepancies in how an agent's policy evaluates state observations for choosing an action, we can reliably detect whether the policy is Trojaned. Experiments on the IARPA RL challenge benchmarks show that our approach can effectively detect Trojaned models even in transfer settings with novel RL environments and modified architectures.
Submission Number: 34
Loading