Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning

Léo Boisvert; Abhay Puri; Chandra Kiran Reddy Evuru; Joshua Kazdan; Avinandan Bose; Quentin Cappart; Maryam Fazel; Sai Rajeswar; Jason Stanley; Nicolas Chapados; Alexandre Drouin; Krishnamurthy Dj Dvijotham

Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning

Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, Joshua Kazdan, Avinandan Bose, Quentin Cappart, Maryam Fazel, Sai Rajeswar, Jason Stanley, Nicolas Chapados, Alexandre Drouin, Krishnamurthy Dj Dvijotham

Published: 08 Jun 2025, Last Modified: 30 Jun 2025WCUA 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Submission Track: Paper Track (up to 8 pages)

Keywords: Data poisoning, backdoor attack, fine-tuning

TL;DR: AI agents are vulnerable to data poisoning during fine-tuning, where injecting just 5% malicious traces can embed stealthy backdoors that leak confidential user information.

Abstract: The rise of AI agents that can use tools, browse the web and interact with computers on behalf of a user, has sparked strong interest in improving these capabilities by explicitly fine-tuning the LLMs/VLMs that power these agents. Several researchers have proposed collecting data by letting the agents interact with their environment (e.g., a computer operating system, the web or a collection of APIs exposed as tools), and improve agent performance by fine tuning on this data. In this work, we show that such data collection can be manipulated by adversaries to insert poisoned traces. By modifying just 5% of collected traces, adversaries can embed stealthy bad behaviors into agents—like leaking confidential user information whenever the tool or webpage exposes a trigger. Our results raise important security concerns in the development of AI agents, and underscore the importance of careful scrutiny of all data collection processes used to improve agentic AI.

Submission Number: 26

Loading