Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

ACL ARR 2025 May Submission1361 Authors

17 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Document retrieval–augmented generation (RAG) systems have shown great promise for enhancing language models with external knowledge, but their evaluation has so far been limited to synthetic or unimodal benchmarks that do not reflect real-world use cases. To address this gap, we introduce \method, the first large-scale, multilingual, multimodal benchmark designed specifically for document RAG. Our benchmark assembles over 62 000 pages of multilingual, multi-type documents, synthesizes 2 000 single-hop and 2 000 multi-hop queries with exhaustive evidence labels via fine-grained principles and a knowledge-graph-driven pipeline, and refines all ground-truth annotations through expert human review to ensure high precision. We evaluate seven state-of-the-art embedding models and three end-to-end RAG frameworks, demonstrating that multimodal embeddings yield significant retrieval gains up to 15.48\% compared to textual embeddings, and current frameworks still struggle with effective pipelines for multi-page understanding. By diagnosing key shortcomings of current approaches and offering a comprehensive evaluation framework, CDOCRAG-BENCH provides a rigorous foundation for future research in multimodal document retrieval-augmented generation. The source code and dataset are publicly avaliable at \url{https://anonymous.4open.science/r/DocRAG_Bench-7D34}.

Paper Type: Long

Research Area: Information Retrieval and Text Mining

Research Area Keywords: Document Retrieval, Retrieval-Augmented Generation, RAG, Multimodal Documents, Evaluation Benchmark, Document Understanding

Contribution Types: Data resources

Languages Studied: English, French, Spanish, Chinese, Japanese, Arabic

Submission Number: 1361

Loading