Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning

Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, Joshua Kazdan, Avinandan Bose, Quentin Cappart, Maryam Fazel, Sai Rajeswar Mudumba, Jason Stanley, Nicolas Chapados, Alexandre Drouin, Krishnamurthy (Dj) Dvijotham

juillet 2025

Résumé

The rise of AI agents that can use tools, browse the web and interact with computers on behalf of a user, has sparked strong interest in improving these capabilities by explicitly fine-tuning the LLMs/VLMs that power these agents. Several researchers have proposed collecting data by letting the agents interact with their environment (e.g., a computer operating system, the web or a collection of APIs exposed as tools), and improve agent performance by fine tuning on this data. In this work, we show that such data collection can be manipulated by adversaries to insert poisoned traces. By modifying just 5% of collected traces, adversaries can embed stealthy bad behaviors into agents—like leaking confidential user information whenever the tool or webpage exposes a trigger. Our results raise important security concerns in the development of AI agents, and underscore the importance of careful scrutiny of all data collection processes used to improve agentic AI.

Type

Atelier

Publication

Workshop at the International Conference of Machine Learning (ICML)