Chain-of-Trigger: An Agentic Backdoor that Paradoxically Enhances Agentic Robustness

Chain-of-Trigger: An Agentic Backdoor that Paradoxically Enhances Agentic Robustness

ACL ARR 2026 January Submission4556 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language model, Agent, Safety, Backdoor attack

Abstract: The deployment of Large Language Model (LLM)-based agents in dynamic environments introduces a unique structural vulnerability: their inherent dependency on sequential observations to drive continuous decision-making. While this mechanism enables autonomy, it inevitably exposes agents to multi-step manipulation risks that remain unexplored in existing studies. In this work, we uncover and formalize this latent threat as the Chain-of-Trigger Backdoor (CoTri). Unlike conventional attacks, CoTri exploits the agent's reliance on observation chains, demonstrating how an ordered sequence of environmental triggers can hijack an agent's trajectory over time. Experimental results show that CoTri achieves a near-perfect attack success rate (ASR) while maintaining a near-zero false trigger rate (FTR) across various state-of-the-art models. Due to training data modeling the stochastic nature of the environment, the implantation of CoTri paradoxically enhances the agent's performance on benign tasks and even improves its robustness against environmental distractions. We further validate CoTri on vision-language models (VLMs), confirming its scalability to multimodal agents. Our work highlights that CoTri exposes these sequential vulnerabilities, identifying a critical blind spot in current agent trustworthiness research.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: safety and alignment for agents

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 4556

Loading