Keywords: Information Retrieval, Document Understanding, Vision Large Language Models, Multimodal Document Retrieval.
TL;DR: We propose an efficient multimodal document retrieval model.
Abstract: Traditional document retrieval systems for PDFs, charts, and infographics rely heavily on Optical Character Recognition (OCR) pipelines to extract textual content, a process that is both error-prone and resource-intensive. Recent advancements in multimodal models like ColPali have enabled OCR-free retrieval by processing documents directly as images, but their large size (three billion parameters) makes them computationally expensive and impractical for large-scale applications. To address this limitation, we introduce ColFlor, an efficient OCR-free visual document retrieval model with only 174 million parameters. ColFlor achieves comparable performance to ColPali on text-rich English documents—with only a 1.8% decrease in performance (measured by NDCG@5 metric)—while being significantly faster in image encoding (5.25 times faster) and query encoding (9.8 times faster). This makes OCR-free document retrieval systems more cost-effective for large-scale applications and more accessible to users with limited computational resources.
Submission Number: 25
Loading