Cybersecurity

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Leading language model (LM) providers like OpenAI and Anthropic allow customers to fine-tune frontier LMs for specific use cases. To …

Joshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, Sanmi Koyejo, Krishnamurthy (Dj) Dvijotham

International Conference on Learning Representations, 2026.

Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning

The rise of AI agents that can use tools, browse the web and interact with computers on behalf of a user, has sparked strong interest …

Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, Joshua Kazdan, Avinandan Bose, Quentin Cappart, Maryam Fazel, Sai Rajeswar Mudumba, Jason Stanley, Nicolas Chapados, Alexandre Drouin, Krishnamurthy (Dj) Dvijotham

Workshop at the International Conference of Machine Learning (ICML), 2025.