ServiceNow AI Research

ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

Abstract

Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or structure the retrieval components. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models, setting a new state-of-the-art on the ViDoRe V1 and V2 benchmarks.

Publication
Conference on Empirical Methods in Natural Language Processing (EMNLP)
Shambhavi Mishra
Shambhavi Mishra
Visiting Researcher

Visiting Researcher at AI Research Deployment​ located at Montreal, QC, Canada.

Spandana Gella
Spandana Gella
Research Manager

Research Manager at Frontier AI Research located at Montreal, QC, Canada.

Torsten Scholak
Torsten Scholak
Research Lead

Research Lead at AI Research Deployment​ located at Montreal, QC, Canada.