Keywords: Multi-hop RAG, Evaluation, RAG evaluation dataset generation
Abstract: Despite the rapid growth of retrieval-augmented generation (RAG) systems in industry, existing evaluation datasets inadequately assess multi-hop reasoning capabilities when deployed on custom enterprise knowledge bases, creating a critical evaluation gap between public benchmarks and real-world performance.
We propose VERGE (VERification-enhanced GEneration), a two-fold RAG evaluation generation pipeline that
(1) employs a Large Language Model (LLM) based verifier to enforce the logical multi-hop reasoning and question-answer (QA) integrity criteria during the question generation process and
(2) iteratively refines any questions that failed those criteria.
Across 10,243 candidate questions spanning five domains, VERGE filters or refines low-quality items, yielding 2,258 verified multi-hop questions.
Human evaluation confirms high verifier reliability (Cohen's $\kappa$ = 0.903) and rates the VERGE-generated dataset significantly higher on QA integrity and distractor quality than the existing method (p $<$ 0.001).
We further propose a hierarchical taxonomy of RAG failure modes, dividing them into Information Processing and Knowledge Boundary errors.
Our analysis reveals that the latter, particularly context utilisation failures dominate across all LLMs.
Our methodology provides both a practical RAG evaluation suite for practitioners and a rigorous foundation for advancing multi-hop reasoning research.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Question Answering, Resources and Evaluation, Retrieval-Augmented Language Models
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 1441
Loading