Representation Matters: A Budget-Controlled Evaluation of Scientific Document Embeddings

ACL ARR 2026 January Submission8145 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal embeddings, retrieval, evaluation
Abstract: We introduce ArXivDocQA, an open-domain retrieval benchmark constructed directly from the raw LaTeX sources of scientific papers. By operating on LaTeX, ArXivDocQA retains fine-grained structural information—including figures, tables, equations, and section boundaries—while enabling controlled construction of decontextualized queries grounded in specific parts of a document. This design enables evaluation under realistic scientific search settings with explicit control over the evidence source. We systematically compare text-only, image-based, and multimodal retrieval representations under varying storage budgets, and show that document-as-image representations—on which many state-of-the-art document retrieval models are trained—are not universally optimal, and their performance depends strongly on where the relevant evidence resides in the document (text, tables, or figures).
Paper Type: Long
Research Area: Information Extraction and Retrieval
Research Area Keywords: document representation,evaluation,benchmarking
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 8145
Loading