ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

Ahmed Masry, Megh Thakkar, Patrice Béchard, Sathwik Madhusudhan, Rabiul Awal, Shambhavi Mishra, Akshay Kalkunte, Enamul Hoque Prince , Spandana Gella, Torsten Scholak, Sai Rajeswar Mudumba

November 2025

Abstract

Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or structure the retrieval components. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models, setting a new state-of-the-art on the ViDoRe V1 and V2 benchmarks.

Type

Conference paper

Publication

Conference on Empirical Methods in Natural Language Processing (EMNLP)