1

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models …

Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque Prince , Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar Mudumba

Neural Information Processing Systems (NeurIPS), 2025.

Causal Differentiating Concepts: Interpreting LM Behavior via Causal Representation Learning

Language model activations entangle concepts that mediate their behavior, making it difficult to interpret these factors, which has …

Navita Goyal, Hal Daumé III, Alexandre Drouin, Dhanya Sridhar

Neural Information Processing Systems (NeurIPS), 2025.

How to Train Your LLM Web Agent: A Statistical Diagnosis

Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary …

Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Thibault Le Sellier De Chezelles, Megh Thakkar, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Steve Liu, Alexandre Drouin, Alexandre Piche, Alexandre Lacoste, Massimo Caccia

Neural Information Processing Systems (NeurIPS), 2025.

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in …

Juan A. Rodriguez, Haotian Zhang, Abhay Puri, Rishav Pramanik, Aarash Feizi, Pascal Wichmann, Arnab Mondal, Mohammad Reza Samsami, Rabiul Awal, Perouz Taslakian, Spandana Gella, Sai Rajeswar Mudumba, David Vazquez, Christopher Pal, Marco Pedersoli

Neural Information Processing Systems (NeurIPS), 2025.

The Promise of RL for Autoregressive Image Editing

The Promise of RL for Autoregressive Image Editing

While image generation techniques are now capable of producing high quality images that respect prompts which span multiple sentences, …

Saba Ahmadi, Rabiul Awal, Ankur Sikarwar, Amirhossein Kazemnejad, Ge Ya Luo, Juan A. Rodriguez, Sai Rajeswar Mudumba, Siva Reddy, Christopher Pal, Benno Krojer, Aishwarya Agrawal

Neural Information Processing Systems (NeurIPS), 2025.

ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, …

Ahmed Masry, Megh Thakkar, Patrice Béchard, Sathwik Madhusudhan, Rabiul Awal, Shambhavi Mishra, Akshay Kalkunte, Enamul Hoque Prince , Spandana Gella, Torsten Scholak, Sai Rajeswar Mudumba

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025.

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended enterprise deep research tasks. Unlike existing …

Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025.

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Understanding diverse web data and automating web development presents an exciting challenge for agentic multimodal models. While …

Rabiul Awal, Mahsa Massoud, Zichao Li, Aarash Feizi, Suyuchen Wang, Christopher Pal, Aishwarya Agrawal, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar Mudumba

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025.

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

While numerous recent benchmarks focus on evaluating generic Vision-Language Models (VLMs), they fall short in addressing the unique …

Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, Salman Khan

International Conference on Computer Vision (ICCV), 2025.

BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning

Charts are essential to data analysis, transforming raw data into clear visual representations that support human decision-making. …

Ahmed Masry, Abhay Puri, Masoud Hashemi, Juan A. Rodriguez, Megh Thakkar, Khyati Mahajan, Vikas Yadav, Sathwik Tejaswi Madhusudhan, Alexandre Piche, Dzmitry Bahdanau, Christopher Pal, David Vazquez, Enamul Hoque Prince , Perouz Taslakian, Sai Rajeswar Mudumba, Spandana Gella

Conference on Language Modeling (COLM), 2025.