RAViG-Bench: A Benchmark for Retrieval-Augmented Visually-rich Generation with Multi-modal Automated Evaluation

09 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmark, retrieval-augmented generation, large language models, visually-rich generation, automated evaluation
Abstract: Retrieval-Augmented Visually-rich Generation (RAViG) extends RAG by integrating textual explanations with multiple visual elements in a well-structured layout. Despite its growing adoption, no existing benchmark offers a holistic evaluation of RAViG. Current RAG benchmarks focus on text-only generation, while natural language to visualization (NL2VIS) benchmarks focus on "show-data-as-chart" style queries and do not follow the RAG paradigm. To address this deficiency, we present RAViG-Bench, the first comprehensive benchmark specifically designed for RAViG. The benchmark features a diverse collection of authentic user queries, each paired with real-world web retrievals to simulate realistic RAViG scenarios. Besides, we introduce a novel multi-modal automated evaluation framework that holistically assesses the quality of RAViG outputs. This framework scrutinizes the generated content by evaluating the functionality, design quality, and content quality of both textual and visual components. Our extensive experiments on leading commercial and open-source LLMs provide a comprehensive analysis of their current capabilities, highlighting significant limitations and charting key directions for future research in this emergent area.
Primary Area: datasets and benchmarks
Submission Number: 3310
Loading