Safety and Security

Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain

The practice of fine-tuning AI agents on data from their own interactions—such as web browsing or tool use—, while being a strong …

Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, Nicolas Chapados, Quentin Cappart, Alexandre Lacoste, Nazanin Sepahvand, Krishnamurthy (Dj) Dvijotham, Alexandre Drouin

ACM Conference on AI and Agentic Systems, 2026.

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Leading language model (LM) providers like OpenAI and Anthropic allow customers to fine-tune frontier LMs for specific use cases. To …

Joshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, Sanmi Koyejo, Krishnamurthy (Dj) Dvijotham

International Conference on Learning Representations, 2026.

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a …

Léo Boisvert, Mihir Bansal, Chandra Kiran Reddy Evuru, Gabriel Huang, Abhay Puri, Avinandan Bose, Maryam Fazel, Quentin Cappart, Jason Stanley, Alexandre Lacoste, Alexandre Drouin, Krishnamurthy (Dj) Dvijotham

Conference on Language Modeling (COLM), 2025.

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a …

Léo Boisvert, Abhay Puri, Gabriel Huang, Mihir Bansal, Chandra Kiran Reddy Evuru, Avinandan Bose, Quentin Cappart, Maryam Fazel, Alexandre Lacoste, Alexandre Drouin, Jason Stanley, Krishnamurthy (Dj) Dvijotham

Workshop at the International Conference of Machine Learning (ICML), 2025.

Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning

The rise of AI agents that can use tools, browse the web and interact with computers on behalf of a user, has sparked strong interest …

Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, Joshua Kazdan, Avinandan Bose, Quentin Cappart, Maryam Fazel, Sai Rajeswar Mudumba, Jason Stanley, Nicolas Chapados, Alexandre Drouin, Krishnamurthy (Dj) Dvijotham

Workshop at the International Conference of Machine Learning (ICML), 2025.

No, of course I can! Refusal Mechanisms Can Be Exploited Using Harmless Fine-Tuning Data

Leading language model (LM) providers like OpenAI and Google offer fine-tuning APIs that allow customers to adapt LMs for specific use …

Joshua Kazdan, Krishnamurthy (Dj) Dvijotham, Sanmi Koyejo

Workshop at the International Conference of Learning Representation (ICLR), 2025.