Keywords: multimodal embeddings, retrieval, evaluation
Abstract: We introduce ArXivDocQA, an open-domain retrieval benchmark constructed directly from the raw LaTeX sources of scientific papers. By operating on LaTeX, ArXivDocQA retains fine-grained structural information—including figures, tables, equations, and section boundaries—while enabling controlled construction of decontextualized queries grounded in specific parts of a document. This design enables evaluation under realistic scientific search settings with explicit control over the evidence source. We systematically compare text-only, image-based, and multimodal retrieval representations under varying storage budgets, and show that document-as-image representations—on which many state-of-the-art document retrieval models are trained—are not universally optimal, and their performance depends strongly on where the relevant evidence resides in the document (text, tables, or figures).
Paper Type: Long
Research Area: Information Extraction and Retrieval
Research Area Keywords: document representation,evaluation,benchmarking
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 8145
Loading