Web Agents

WebArena-Pro: A Heterogeneous, Multimodal, Reproducible Benchmark for Web Agents

Web agents powered by large language and vision-language models are increasingly applied to realistic browser work that spans …

Imene Kerboua, Fatemeh Pesaran, Xing Han Lu, Weijian Qi, Alexander Miller, Junyi Song, Yunjia Tian, Dongjin Kang, Seyeon Choi, Marzia Nouri, Ewen Gueguen, Matteo Boglioni, Fengyuan Liu, Zeyi Liao, Mengqi Yuan, Yue Li, Alexandre Lacoste, Alexandre Drouin, Spandana Gella, Huan Sun, Gunhee Kim, Siva Reddy

Workshop at the International Conference of Machine Learning (ICML), 2026.

How to Train Your LLM Web Agent: A Statistical Diagnosis

Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary …

Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Thibault Le Sellier De Chezelles, Megh Thakkar, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Steve Liu, Alexandre Drouin, Alexandre Piche, Alexandre Lacoste, Massimo Caccia

Workshop at the Neural Information Processing Systems (NeurIPS), 2025.

How to Train Your LLM Web Agent: A Statistical Diagnosis

Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary …

Neural Information Processing Systems (NeurIPS), 2025.

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Understanding diverse web data and automating web development presents an exciting challenge for agentic multimodal models. While …

Rabiul Awal, Mahsa Massoud, Zichao Li, Aarash Feizi, Suyuchen Wang, Christopher Pal, Aishwarya Agrawal, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar Mudumba

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025.

How to Train Your LLM Web Agent: A Statistical Diagnosis (Oral)

Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary …

Workshop at the International Conference of Machine Learning (ICML), 2025.

SafeArena: Evaluating the Safety of Autonomous Web Agents

LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse …

Ada Tur, Nicholas Meade, Xing Han Lu, Alejandra Zambrano, Arkil Patel, Esin Durmus, Spandana Gella, Karolina Stanczak, Siva Reddy

International Conference on Machine Learning (ICML), 2025.

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Developing autonomous agents that can navigate diverse Graphical User Interfaces (GUIs) and solve complex tasks is essential for …

Shravan Nayak, Xiangru Jian, Kevin Lin, Juan A. Rodriguez, Motek Kalsi, Nicolas Chapados, Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, Sai Rajeswar Mudumba

International Conference on Machine Learning (ICML), 2025.

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Understanding diverse web data and automating web development presents an exciting challenge for agentic AI. While existing benchmarks …

Rabiul Awal, Mahsa Massoud, Zichao Li, Aarash Feizi, Suyuchen Wang, Christopher Pal, Aishwarya Agrawal, David Vazquez, Siva Reddy, Juan A. Rodriguez, Perouz Taslakian, Spandana Gella, Sai Rajeswar Mudumba

Workshop at the International Conference of Learning Representation (ICLR), 2025.

The BrowserGym Ecosystem for Web Agent Research

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those …

Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Siva Reddy, Quentin Cappart, Graham Neubig, Nicolas Chapados, Alexandre Lacoste

Transactions on Machine Learning Research (TMLR), 2025.

AgentMerge: Enhancing Generalization in Fine-Tuned LLM Agents

Recent advancements in large language models (LLMs) have spurred interest in developing autonomous agents capable of performing complex …

Megh Thakkar, Léo Boisvert, Thibault Le Sellier De Chezelles, Alexandre Piche, Maxime Gasse, Alexandre Lacoste, Massimo Caccia

Workshop at the Neural Information Processing Systems (NeurIPS), 2024.