Keywords: Evaluation, Document Retrieval, Multimodal Retrieval
Abstract: Scientific papers present information through multiple modalities, including text, tables, and figures. Retrieving relevant documents given a user query is challenging because evidence may be distributed across these heterogeneous sources. Existing benchmarks focus either on text-only retrieval or on image-based retrieval, and even multimodal benchmarks often rely on PDF pages converted to images, which capture layout but overlook the structured and textual content of the document. We introduce two new datasets for document-level scientific retrieval: \textbf{ArXivDocQA} and \textbf{SciFactDoc}, built from full \LaTeX{} sources of arXiv papers. We evaluate state-of-the-art text retrievers and multimodal retrievers under different representation strategies, including \textit{text-only}, \textit{figures-only}, and \textit{PDF-as-pages}. Our results show that no dataset achieves its best performance with PDF pages, suggesting that page-image encodings—despite their use in recent multimodal retrievers—are suboptimal for scientific retrieval. Surprisingly, ColPali remains competitive even when applied to text-only representations, highlighting cross-modal robustness. Our findings demonstrate that representing scientific documents as coherent multimodal objects remains an open challenge, and our datasets provide a new testbed for advancing document-level multimodal retrieval.
Submission Number: 340
Loading