Colflor: Towards Bert-Size Vision-Language Document Retrieval Models

Ahmed Masry, Enamul Hoque

Published: 2025, Last Modified: 06 Jan 2026MLSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Traditional document retrieval systems for PDFs, charts, and infographics rely heavily on Optical Character Recognition (OCR) pipelines to extract textual content, a process that is both error-prone and resource-intensive. Recent advancements in multimodal models like ColPali have enabled OCRfree retrieval by processing documents directly as images, but their large size (three billion parameters) makes them computationally expensive and impractical for large-scale applications. To address this limitation, we introduce ColFlor, an efficient OCR-free visual document retrieval model with only 174 million parameters. ColFlor achieves comparable performance to ColPali on text-rich English documents, with only a 1.8% decrease in performance (measured by NDCG@5 metric), while being significantly faster in image encoding (5.25 times faster) and query encoding (9.8 times faster). This makes OCR-free document retrieval systems more cost-effective for large-scale applications and more accessible to users with limited computational resources.

External IDs:dblp:conf/mlsp/MasryH25