VERGE: Verification-Enhanced Generation of Multi-Hop Evaluation Datasets for Task-Specific RAG

Published: 29 Apr 2026, Last Modified: 29 Apr 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Question Answering, Resources and Evaluation, Retrieval-Augmented Language Models
Abstract: Despite the rapid growth of retrieval-augmented generation (RAG) systems in industry, existing evaluation datasets inadequately assess multi-hop reasoning capabilities when deployed on custom enterprise knowledge bases, creating a critical evaluation gap between public benchmarks and real-world performance. We propose VERGE (VERification-enhanced GEneration), a two-fold RAG evaluation generation pipeline that (1) employs a Large Language Model (LLM) based verifier to enforce the logical multi-hop reasoning and question-answer (QA) integrity criteria during the question generation process and (2) iteratively refines any questions that failed those criteria. Across 10,243 candidate questions spanning five domains, VERGE filters or refines low-quality items, yielding 2,258 verified multi-hop questions. Human evaluation confirms high verifier reliability (Cohen's $\kappa$ = 0.903) and rates the VERGE-generated dataset significantly higher on QA integrity and distractor quality than the existing method (p $<$ 0.001). We further propose a hierarchical taxonomy of RAG failure modes, dividing them into Information Processing and Knowledge Boundary errors. Our analysis reveals that the latter, particularly context utilisation failures dominate across all LLMs.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Type: Research Paper
Archival Status: Non-archival
Submission Number: 62
Loading