AdaptiveBackdoor: Backdoored Language Model Agents that Detect Human Overseers

Heng Wang; Ruiqi Zhong; Jiaxin Wen; Jacob Steinhardt

AdaptiveBackdoor: Backdoored Language Model Agents that Detect Human Overseers

Heng Wang, Ruiqi Zhong, Jiaxin Wen, Jacob Steinhardt

Published: 03 Jul 2024, Last Modified: 16 Jul 2024ICML 2024 FM-Wild Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Agent, AI safety, Backdoor Attack, Deception

Abstract: As humans grant language model (LM) agents more access to their machines, we speculate a new form of cyber attack, AdaptiveBackdoor, where an LM agent is backdoored to detect whether its actions will be overseen by humans and act maliciously when effective oversight is not present. We provide a proof-of-concept with AutoGPT: to exfiltrate a private key without being caught, a backdoored LM agent would detect whether its actions are being overseen by extracting the command used to run it or inferring the skill level of the human overseer. Our attack is potentially more elusive than traditional malware: first, malicious behaviors hidden in LM parameters are harder to detect than source code; additionally, LMs can potentially detect human overseers more adaptively than rule-based logic, thus avoiding being caught. To evaluate current LMs' ability to perform such an attack, we propose a dataset of LM agent scaffolds and benchmark LMs' capability to analyze them and reason about human overseers. Our work highlights the potential vulnerabilities of LM agents and the need to detect backdoors.

Submission Number: 76

Loading