Benchmark

WebArena-Pro: A Heterogeneous, Multimodal, Reproducible Benchmark for Web Agents

Web agents powered by large language and vision-language models are increasingly applied to realistic browser work that spans …

Imene Kerboua, Fatemeh Pesaran, Xing Han Lu, Weijian Qi, Alexander Miller, Junyi Song, Yunjia Tian, Dongjin Kang, Seyeon Choi, Marzia Nouri, Ewen Gueguen, Matteo Boglioni, Fengyuan Liu, Zeyi Liao, Mengqi Yuan, Yue Li, Alexandre Lacoste, Alexandre Drouin, Spandana Gella, Huan Sun, Gunhee Kim, Siva Reddy

Workshop at the International Conference of Machine Learning (ICML), 2026.

DRBench: A Realistic Benchmark for Enterprise Deep Research

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike …

Amirhossein Abaskohi, Tianyi Chen, Miguel Muñoz-Mármol, Curtis Fox, Amrutha Ramesh, Étienne Marcotte, Xing Han Lu, Nicolas Chapados, Spandana Gella, Christopher Pal, Alexandre Drouin, Issam H. Laradji

International Conference on Learning Representations, 2026.

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

While numerous recent benchmarks focus on evaluating generic Vision-Language Models (VLMs), they fall short in addressing the unique …

Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, Salman Khan

International Conference on Computer Vision (ICCV), 2025.

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Understanding diverse web data and automating web development presents an exciting challenge for agentic AI. While existing benchmarks …

Rabiul Awal, Mahsa Massoud, Zichao Li, Aarash Feizi, Suyuchen Wang, Christopher Pal, Aishwarya Agrawal, David Vazquez, Siva Reddy, Juan A. Rodriguez, Perouz Taslakian, Sai Rajeswar Mudumba

Workshop at the Computer Vision and Pattern Recognition Conference (CVPR), 2025.

The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications

Causal discovery aims to automatically uncover causal relationships from data, a capability with significant potential across many …

Philippe Brouillard, Chandler Squires, Jonas Wahl, Konrad P. Kording, Karen Sachs, Dhanya Sridhar, Alexandre Drouin

Causal Learning and Reasoning (CLeaR), 2025.

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Understanding diverse web data and automating web development presents an exciting challenge for agentic AI. While existing benchmarks …

Rabiul Awal, Mahsa Massoud, Zichao Li, Aarash Feizi, Suyuchen Wang, Christopher Pal, Aishwarya Agrawal, David Vazquez, Siva Reddy, Juan A. Rodriguez, Perouz Taslakian, Spandana Gella, Sai Rajeswar Mudumba

Workshop at the International Conference of Learning Representation (ICLR), 2025.

MMTEB: Massive Multilingual Text Embedding Benchmark

Text embeddings are typically evaluated on a narrow set of tasks, limited in terms of languages, domains, and task types. To circumvent …

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzeminski, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çagatan, Akash Kundu, Martin Bernstorff, Shitao Xiao, Akshita Sukhlecha, Bhavish Pahwa, Rafał Poswiata, Kranthi Kiran GV, Shawon Ashraf, Daniel Auras, Björn Plüster, Jan Philipp Harries, Loïc Magne, Isabelle Mohr, Mariya Hendriksen, Dawei Zhu, Hippolyte Gisserot-Boukhlef, Tom Aarsen, Jan Kostkan, Konrad Wojtasik, Taemin Lee, Marek Šuppa, Crystina Zhang, Roberta Rocca, Mohammed Hamdy, Andrianos Michail, John Yang, Manuel Faysse, Aleksei Vatolin, Nandan Thakur, Manan Dey, Dipam Vasani, Pranjal Chitale, Simone Tedeschi, Nguyen Tai, Artem Snegirev, Michael Günther, Mengzhou Xia, Weijia Shi, Xing Han Lu, Jordan Clive, Gayatri Krishnakumar, Anna Maksimova, Silvan Wehrli, Maria Tikhonova, Henil Pancha, Aleksandr Abramov, Malte Ostendorff, Zheng Liu, Simon Clematide, Lester James Miranda, Alena Fenogenova, Guangyu Song, Ruqiya Bin Saf, Wen-Ding Li, Alessia Borghini, Federico Cassano, Hongjin Su, Jimmy Lin, Howard Yen, Lasse Hansen, Sara Hooker, Chenghao Xiao, Vaibhav Adlakha, Orion Weller, Siva Reddy, Niklas Muennighoff

International Conference of Learning Representations (ICLR), 2025.

EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision

This paper presents EarthView, a comprehensive dataset specifically designed for self-supervision on remote sensing data, intended to …

Diego Velazquez, Pau Rodriguez, Sergio Alonso, Josep M. Gonfaus, Jordi Gonzalez, Gerardo Richarte, Javier Marin, Yoshua Bengio, Alexandre Lacoste

Workshop at the Winter Conference on Applications of Computer Vision (WACV), 2025.

Context is Key: A Benchmark for Forecasting with Essential Textual Information

Forecasting is a critical task in decision making across various domains. While numerical data provides a foundation, it often lacks …

Andrew Williams, Arjun Ashok, Étienne Marcotte, Valentina Zantedeschi, Jithendaraa Subramanian, Roland Riachi, James Requeima, Alexandre Lacoste, Irina Rish, Nicolas Chapados, Alexandre Drouin

Workshop at the Neural Information Processing Systems (NeurIPS), 2024.

Context is Key: A Benchmark for Forecasting with Essential Textual Information

Forecasting is a critical task in decision making across various domains. While numerical data provides a foundation, it often lacks …

Andrew Williams, Arjun Ashok, Étienne Marcotte, Valentina Zantedeschi, Jithendaraa Subramanian, Roland Riachi, James Requeima, Alexandre Lacoste, Irina Rish, Nicolas Chapados, Alexandre Drouin

Montreal AI Symposium (MAIS), 2024.