Keywords: LLM Agent, AI safety, Backdoor Attack, Deception
Abstract: As humans grant language model (LM) agents more access to their machines, we speculate a new form of cyber attack, AdaptiveBackdoor, where an LM agent is backdoored to detect whether its actions will be overseen by humans and act maliciously when effective oversight is not present. We provide a proof-of-concept with AutoGPT: to exfiltrate a private key without being caught, a backdoored LM agent would detect whether its actions are being overseen by extracting the command used to run it or inferring the skill level of the human overseer. Our attack is potentially more elusive than traditional malware: first, malicious behaviors hidden in LM parameters are harder to detect than source code; additionally, LMs can potentially detect human overseers more adaptively than rule-based logic, thus avoiding being caught. To evaluate current LMs' ability to perform such an attack, we propose a dataset of LM agent scaffolds and benchmark LMs' capability to analyze them and reason about human overseers. Our work highlights the potential vulnerabilities of LM agents and the need to detect backdoors.
Submission Number: 76
Loading