Abstract: Document retrieval–augmented generation (RAG) systems have shown great promise for enhancing language models with external knowledge, but their evaluation has so far been limited to synthetic or unimodal benchmarks that do not reflect real-world use cases. To address this gap, we introduce \method, the first large-scale, multilingual, multimodal benchmark designed specifically for document RAG. Our benchmark assembles over 62 000 pages of multilingual, multi-type documents, synthesizes 2 000 single-hop and 2 000 multi-hop queries with exhaustive evidence labels via fine-grained principles and a knowledge-graph-driven pipeline, and refines all ground-truth annotations through expert human review to ensure high precision. We evaluate seven state-of-the-art embedding models and three end-to-end RAG frameworks, demonstrating that multimodal embeddings yield significant retrieval gains up to 15.48\% compared to textual embeddings, and current frameworks still struggle with effective pipelines for multi-page understanding. By diagnosing key shortcomings of current approaches and offering a comprehensive evaluation framework, CDOCRAG-BENCH provides a rigorous foundation for future research in multimodal document retrieval-augmented generation. The source code and dataset are publicly avaliable at \url{https://anonymous.4open.science/r/DocRAG_Bench-7D34}.
Paper Type: Long
Research Area: Information Retrieval and Text Mining
Research Area Keywords: Document Retrieval, Retrieval-Augmented Generation, RAG, Multimodal Documents, Evaluation Benchmark, Document Understanding
Contribution Types: Data resources
Languages Studied: English, French, Spanish, Chinese, Japanese, Arabic
Submission Number: 1361
Loading