Agents

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent …

Nizar Islah, Justine Gehring, Diganta Misra, Eilif Muller, Irina Rish, Eilif Benjamin Muller, Massimo Caccia

Annual Meeting of the Association for Computational Linguistics (ACL), 2026.

Societal Alignment Frameworks Can Improve LLM Alignment

Recent progress in large language models (LLMs) has focused on producing responses that meet human expectations and align with shared …

Karolina Stanczak, Nicholas Meade, Mehar Bhatia, Hattie Zhou, Konstantin Böttinger, Jeremy Barns, Jason Stanley, Nicolas Papernot, Nicolas Chapados, Denis Therien, Timothy P Lillicrap, Ana Marasovic, Sylvie Delacroix, Gillian K Hadfield, Siva Reddy

ACM Conference on Fairness, Accountability, and Transparency, 2026.

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Leading language model (LM) providers like OpenAI and Anthropic allow customers to fine-tune frontier LMs for specific use cases. To …

Joshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, Sanmi Koyejo, Krishnamurthy (Dj) Dvijotham

International Conference on Learning Representations, 2026.

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent …

Nizar Islah, Justine Gehring, Diganta Misra, Eilif Muller, Irina Rish, Eilif Benjamin Muller, Massimo Caccia

Workshop at the Neural Information Processing Systems (NeurIPS), 2025.

How to Train Your LLM Web Agent: A Statistical Diagnosis

Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary …

Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Thibault Le Sellier De Chezelles, Megh Thakkar, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Steve Liu, Alexandre Drouin, Alexandre Piche, Alexandre Lacoste, Massimo Caccia

Workshop at the Neural Information Processing Systems (NeurIPS), 2025.

How to Train Your LLM Web Agent: A Statistical Diagnosis

Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary …

Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Thibault Le Sellier De Chezelles, Megh Thakkar, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Steve Liu, Alexandre Drouin, Alexandre Piche, Alexandre Lacoste, Massimo Caccia

Neural Information Processing Systems (NeurIPS), 2025.

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended enterprise deep research tasks. Unlike existing …

Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025.

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Understanding diverse web data and automating web development presents an exciting challenge for agentic multimodal models. While …

Rabiul Awal, Mahsa Massoud, Zichao Li, Aarash Feizi, Suyuchen Wang, Christopher Pal, Aishwarya Agrawal, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar Mudumba

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025.

BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning

Charts are essential to data analysis, transforming raw data into clear visual representations that support human decision-making. …

Ahmed Masry, Abhay Puri, Masoud Hashemi, Juan A. Rodriguez, Megh Thakkar, Khyati Mahajan, Vikas Yadav, Sathwik Tejaswi Madhusudhan, Alexandre Piche, Dzmitry Bahdanau, Christopher Pal, David Vazquez, Enamul Hoque Prince , Perouz Taslakian, Sai Rajeswar Mudumba, Spandana Gella

Conference on Language Modeling (COLM), 2025.

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a …

Léo Boisvert, Mihir Bansal, Chandra Kiran Reddy Evuru, Gabriel Huang, Abhay Puri, Avinandan Bose, Maryam Fazel, Quentin Cappart, Jason Stanley, Alexandre Lacoste, Alexandre Drouin, Krishnamurthy (Dj) Dvijotham

Conference on Language Modeling (COLM), 2025.