Advancing Reliable and Explainable Evaluation of Domain-Specific Retrieval-Augmented Language Models
Abstract: The advent of Large Language Models (LLMs) has significantly advanced the capabilities of Retrieval-augmented Generation (RAG) systems, leading to their extensive research and deployment across various industries for domain-specific knowledge querying. However, evaluating these systems presents unique challenges due to the scarcity of domain-specific queries and corresponding ground truths, as well as a lack of systematic approaches to diagnosing the cause of failure cases—whether they stem from knowledge deficits or issues related to system robustness.To address these challenges, we introduce an evaluation framework comprising two key elements: 1) a data generation process that leverages relational databases and LLMs to efficiently produce scalable query-answer pairs, facilitating the separation of query logic from linguistic variations for enhanced debugging capabilities; and 2) an explainable evaluation protocol equipped with a novel metric that assesses the extent of knowledge comprehension and system robustness in both retrieval and language modeling contexts. Importantly, our empirical findings highlight the shortcomings of prevalent reference-free evaluation methods, positioning our reliable reference-based evaluation protocol as a valuable adjunct.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
0 Replies
Loading