ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models

ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models

NeurIPS 2024 Workshop MusIML Submission25 Authors

Published: 30 Nov 2024, Last Modified: 01 Dec 2024MusIML PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Information Retrieval, Document Understanding, Vision Large Language Models, Multimodal Document Retrieval.

TL;DR: We propose an efficient multimodal document retrieval model.

Abstract:

Traditional document retrieval systems for PDFs, charts, and infographics rely heavily on Optical Character Recognition (OCR) pipelines to extract textual content, a process that is both error-prone and resource-intensive. Recent advancements in multimodal models like ColPali have enabled OCR-free retrieval by processing documents directly as images, but their large size (three billion parameters) makes them computationally expensive and impractical for large-scale applications. To address this limitation, we introduce ColFlor, an efficient OCR-free visual document retrieval model with only 174 million parameters. ColFlor achieves comparable performance to ColPali on text-rich English documents—with only a 1.8% decrease in performance (measured by NDCG@5 metric)—while being significantly faster in image encoding (5.25 times faster) and query encoding (9.8 times faster). This makes OCR-free document retrieval systems more cost-effective for large-scale applications and more accessible to users with limited computational resources.

Submission Number: 25

Loading