QAS: A Composite Query-Attributed Score for Evaluating Retrieval-Augmented Generation Systems

Maira Ata; Sumaira Saeed; Nida Saddaf Khan

QAS: A Composite Query-Attributed Score for Evaluating Retrieval-Augmented Generation Systems

Maira Ata, Sumaira Saeed, Nida Saddaf Khan

Published: 24 Nov 2025, Last Modified: 24 Nov 20255th Muslims in ML Workshop co-located with NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Retrieval-Augmented Generation (RAG), Evaluation Metrics, Question Answering (QA), ultidimensional Assessment, Explainable AI

TL;DR: We introduce QAS, a reference-free, interpretable metric that decomposes RAG evaluation into five dimensions—grounding, faithfulness, coverage, efficiency, and relevance—aligning closely with human judgments across domains.

Abstract: Retrieval Augmented Generation (RAG) systems have advanced knowledge-grounded QA, but evaluation remains challenging due to competing demands of faithfulness to evidence, coverage of query-relevant information, and computational efficiency. We introduce QAS, a composite Query-Attributed Score for fine-grained, interpretable evaluation of RAG. QAS decomposes quality into five dimensions—grounding, retrieval coverage, answer faithfulness, context efficiency, and relevance—each computed with lightweight, task-agnostic metrics (token/entity attribution, n-gram overlap, factual consistency, redundancy penalties, and embedding similarity). A linear combination with tunable weights yields a unified score plus per-dimension diagnostics. Across five QA benchmarks (open-domain, biomedical, legal/regulatory, customer-support, and news), QAS aligns closely with human judgments at moderate cost. Ablations confirm each dimension’s necessity, establishing QAS as a transparent, practical framework for reliable RAG evaluation.

Track: Track 2: ML by Muslim Authors

Submission Number: 36

Loading