Keywords: Large language model, Agent, Safety, Backdoor attack
Abstract: The deployment of Large Language Model (LLM)-based agents in dynamic environments introduces a unique structural vulnerability: their inherent dependency on sequential observations to drive continuous decision-making. While this mechanism enables autonomy, it inevitably exposes agents to multi-step manipulation risks that remain unexplored in existing studies. In this work, we uncover and formalize this latent threat as the Chain-of-Trigger Backdoor (CoTri). Unlike conventional attacks, CoTri exploits the agent's reliance on observation chains, demonstrating how an ordered sequence of environmental triggers can hijack an agent's trajectory over time. Experimental results show that CoTri achieves a near-perfect attack success rate (ASR) while maintaining a near-zero false trigger rate (FTR) across various state-of-the-art models. Due to training data modeling the stochastic nature of the environment, the implantation of CoTri paradoxically enhances the agent's performance on benign tasks and even improves its robustness against environmental distractions. We further validate CoTri on vision-language models (VLMs), confirming its scalability to multimodal agents. Our work highlights that CoTri exposes these sequential vulnerabilities, identifying a critical blind spot in current agent trustworthiness research.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment for agents
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 4556
Loading