Human Preference Guided Evaluation of RAG-based QA Systems

ACL ARR 2026 January Submission7347 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Retrieval-Augmented Generation, RAG, Biomedical QA, Human Preference Alignment, LLM Evaluation
Abstract: Retrieval-Augmented Generation (RAG) systems are increasingly used for question answering (QA) tasks. However, evaluating their effectiveness remains challenging due to the lack of reliable reference resources and the frequent misalignment between automatic metrics and human judgments. Moreover, there is a scarcity of datasets that contain real-world questions posed by domain experts. To address this problem, we introduce novel biomedical question datasets concerning the association of wheat genes with specific traits, along with human preference annotations on LLM‑generated answers. Methodologically, we explore how to leverage human preferences to calibrate off‑the‑shelf metrics for automatic evaluation, and we find that the calibrated metrics achieve higher agreement with human preferences compared to baseline metrics on the held‑out test set.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Retrieval-Augmented Generation, RAG, Question Answering, Human Preference Alignment, LLM Evaluation, Evaluation Metrics
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 7347
Loading