Safety and Security

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Leading language model (LM) providers like OpenAI and Anthropic allow customers to fine-tune frontier LMs for specific use cases. To …

Joshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, Sanmi Koyejo, Krishnamurthy (Dj) Dvijotham

International Conference on Learning Representations, 2026.

Attack What Matters: Integrating Expert Insight and Automation in Threat-Model-Aligned Red Teaming

Attack What Matters: Integrating Expert Insight and Automation in Threat-Model-Aligned Red Teaming

Prompt injection attacks target a key vulnerability in modern large language models: their inability to reliably distinguish between …

Kiarash Mohammadi, Abhay Puri, Georges Belanger Albarran, Mihir Bansal, Navdeep Gill, Yanick Chénard, Segan Subramanian, Marc-Etienne Brunet , Jason Stanley

NOW AI, 2025.

Shifting AI Security to the Left: Design-Time Defenses to Mitigate the Risks of Prompt Injections

Prompt injections pose a critical weakness for modern Large Language Models, making it difficult for AI to distinguish between …

Abhay Puri, Kevin Kasa, Kiarash Mohammadi, Georges Belanger Albarran, Mihir Bansal, Yanick Chénard, Marc-Etienne Brunet , Jason Stanley

NOW AI, 2025.

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a …

Léo Boisvert, Mihir Bansal, Chandra Kiran Reddy Evuru, Gabriel Huang, Abhay Puri, Avinandan Bose, Maryam Fazel, Quentin Cappart, Jason Stanley, Alexandre Lacoste, Alexandre Drouin, Krishnamurthy (Dj) Dvijotham

Conference on Language Modeling (COLM), 2025.

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a …

Léo Boisvert, Abhay Puri, Gabriel Huang, Mihir Bansal, Chandra Kiran Reddy Evuru, Avinandan Bose, Quentin Cappart, Maryam Fazel, Alexandre Lacoste, Alexandre Drouin, Jason Stanley, Krishnamurthy (Dj) Dvijotham

Workshop at the International Conference of Machine Learning (ICML), 2025.

Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning

Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning

The rise of AI agents that can use tools, browse the web and interact with computers on behalf of a user, has sparked strong interest …

Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, Joshua Kazdan, Avinandan Bose, Quentin Cappart, Maryam Fazel, Sai Rajeswar Mudumba, Jason Stanley, Nicolas Chapados, Alexandre Drouin, Krishnamurthy (Dj) Dvijotham

Workshop at the International Conference of Machine Learning (ICML), 2025.

No, of course I can! Refusal Mechanisms Can Be Exploited Using Harmless Fine-Tuning Data

Leading language model (LM) providers like OpenAI and Google offer fine-tuning APIs that allow customers to adapt LMs for specific use …

Joshua Kazdan, Krishnamurthy (Dj) Dvijotham, Sanmi Koyejo

Workshop at the International Conference of Learning Representation (ICLR), 2025.