Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs

Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs

ACL ARR 2026 January Submission7225 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, rubric-based evaluation, adaptive rubrics, reinforcement learning from feedback, healthcare NLP

Abstract: Rubrics are essential for evaluating open-ended LLM responses, especially in safety-critical domains such as healthcare. However, creating high-quality, domain-specific rubrics typically requires significant human expertise and time, making rubric-based evaluation and training difficult to scale. In this work, we introduce Health-SCORE, a generalizable and scalable rubric-based evaluation framework that substantially reduces rubric development costs without sacrificing performance. We show that Health-SCORE provides two practical benefits beyond standalone evaluation: it can be used as a structured reward signal to guide reinforcement learning with safety-aware supervision, and it can be incorporated directly into prompts to improve response quality through in-context learning. Across open-ended healthcare tasks, Health-SCORE achieves evaluation quality comparable to human-created rubrics while significantly lowering development effort, making rubric-based evaluation and training more scalable.

Paper Type: Long

Research Area: Clinical and Biomedical Applications

Research Area Keywords: medical question answering, clinical dialogue systems, evaluation and metrics, automatic evaluation

Contribution Types: Approaches to low-resource settings

Languages Studied: English

Submission Number: 7225

Loading