Keywords: Retrieval-Augmented Generation, RAG, Biomedical QA, Human Preference Alignment, LLM Evaluation
Abstract: Retrieval-Augmented Generation (RAG) systems are increasingly used for question answering (QA) tasks. However, evaluating their effectiveness remains challenging due to the lack of reliable reference resources and the frequent misalignment between automatic metrics and human judgments. Moreover, there is a scarcity of datasets that contain real-world questions posed by domain experts. To address this problem, we introduce novel biomedical question datasets concerning the association of wheat genes with specific traits, along with human preference annotations on LLM‑generated answers. Methodologically, we explore how to leverage human preferences to calibrate off‑the‑shelf metrics for automatic evaluation, and we find that the calibrated metrics achieve higher agreement with human preferences compared to baseline metrics on the held‑out test set.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Retrieval-Augmented Generation, RAG, Question Answering, Human Preference Alignment, LLM Evaluation, Evaluation Metrics
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 7347
Loading