Human Preference Guided Evaluation of RAG-based QA Systems

Human Preference Guided Evaluation of RAG-based QA Systems

ACL ARR 2026 January Submission7347 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Retrieval-Augmented Generation, RAG, Biomedical QA, Human Preference Alignment, LLM Evaluation

Abstract: Retrieval-Augmented Generation (RAG) systems are increasingly used for question answering (QA) tasks. However, evaluating their effectiveness remains challenging due to the lack of reliable reference resources and the frequent misalignment between automatic metrics and human judgments. Moreover, there is a scarcity of datasets that contain real-world questions posed by domain experts. To address this problem, we introduce novel biomedical question datasets concerning the association of wheat genes with specific traits, along with human preference annotations on LLM‑generated answers. Methodologically, we explore how to leverage human preferences to calibrate off‑the‑shelf metrics for automatic evaluation, and we find that the calibrated metrics achieve higher agreement with human preferences compared to baseline metrics on the held‑out test set.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Retrieval-Augmented Generation, RAG, Question Answering, Human Preference Alignment, LLM Evaluation, Evaluation Metrics

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Submission Number: 7347

Loading