Agents

Privileged Information Distillation for Language Models

Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a …

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, Massimo Caccia

International Conference on Machine Learning (ICML), 2026.

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent …

Nizar Islah, Justine Gehring, Diganta Misra, Eilif Muller, Irina Rish, Eilif Benjamin Muller, Massimo Caccia

Annual Meeting of the Association for Computational Linguistics (ACL), 2026.

Societal Alignment Frameworks Can Improve LLM Alignment

Recent progress in large language models (LLMs) has focused on producing responses that meet human expectations and align with shared …

Karolina Stanczak, Nicholas Meade, Mehar Bhatia, Hattie Zhou, Konstantin Böttinger, Jeremy Barns, Jason Stanley, Nicolas Papernot, Nicolas Chapados, Denis Therien, Timothy P Lillicrap, Ana Marasovic, Sylvie Delacroix, Gillian K Hadfield, Siva Reddy

ACM Conference on Fairness, Accountability, and Transparency, 2026.

CUA-Suite: Expert Trajectories and Pixel-Precise Grounding for Computer-use Agents

Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li, Patrice Béchard, Spandana Gella, Sai Rajeswar Mudumba

Workshop at the International Conference of Machine Learning (ICML), 2026.

No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms

Leading language model (LM) providers like OpenAI and Anthropic allow customers to fine-tune frontier LMs for specific use cases. To …

Joshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, Sanmi Koyejo, Krishnamurthy (Dj) Dvijotham

International Conference on Learning Representations, 2026.

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent …

Nizar Islah, Justine Gehring, Diganta Misra, Eilif Muller, Irina Rish, Eilif Benjamin Muller, Massimo Caccia

Workshop at the Neural Information Processing Systems (NeurIPS), 2025.

How to Train Your LLM Web Agent: A Statistical Diagnosis

Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary …

Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Thibault Le Sellier De Chezelles, Megh Thakkar, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Steve Liu, Alexandre Drouin, Alexandre Piche, Alexandre Lacoste, Massimo Caccia

Workshop at the Neural Information Processing Systems (NeurIPS), 2025.

How to Train Your LLM Web Agent: A Statistical Diagnosis

Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary …

Neural Information Processing Systems (NeurIPS), 2025.

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended enterprise deep research tasks. Unlike existing …

Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025.

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Understanding diverse web data and automating web development presents an exciting challenge for agentic multimodal models. While …

Rabiul Awal, Mahsa Massoud, Zichao Li, Aarash Feizi, Suyuchen Wang, Christopher Pal, Aishwarya Agrawal, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar Mudumba

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025.