Advancing Reliable and Explainable Evaluation of Domain-Specific Retrieval-Augmented Language Models

Anonymous

Advancing Reliable and Explainable Evaluation of Domain-Specific Retrieval-Augmented Language Models

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: The advent of Large Language Models (LLMs) has significantly advanced the capabilities of Retrieval-augmented Generation (RAG) systems, leading to their extensive research and deployment across various industries for domain-specific knowledge querying. However, evaluating these systems presents unique challenges due to the scarcity of domain-specific queries and corresponding ground truths, as well as a lack of systematic approaches to diagnosing the cause of failure cases—whether they stem from knowledge deficits or issues related to system robustness.To address these challenges, we introduce an evaluation framework comprising two key elements: 1) a data generation process that leverages relational databases and LLMs to efficiently produce scalable query-answer pairs, facilitating the separation of query logic from linguistic variations for enhanced debugging capabilities; and 2) an explainable evaluation protocol equipped with a novel metric that assesses the extent of knowledge comprehension and system robustness in both retrieval and language modeling contexts. Importantly, our empirical findings highlight the shortcomings of prevalent reference-free evaluation methods, positioning our reliable reference-based evaluation protocol as a valuable adjunct.

Paper Type: long

Research Area: Resources and Evaluation

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

0 Replies

Loading