HREF: Human Response-Guided Evaluation of Instruction Following in Language Models

HREF: Human Response-Guided Evaluation of Instruction Following in Language Models

ACL ARR 2025 February Submission6923 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Evaluating instruction following in Language Models has heavily relied on using LLMs as judges. In this work, we reevaluate the common choices for automatic evaluation setup, e.g.: choice of model and prompting strategy, on a wide range of instruction-following tasks. We experiment with methods that leverage human-written responses and observe that they enhance the reliability of automatic evaluations across these tasks, resulting in up to a 3.2% improvement in agreement with human judges. We also show that human-written responses offer an orthogonal perspective to model-generated responses in following instructions and should be used as additional context when comparing model responses. Based on these observations, we develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF). It contains 4,258 human-written instructions spanning 11 task categories. To prevent test-set leakage, we keep a portion of our evaluation dataset hidden. We publicly release a separate development set, code to evaluate models on it, and host a live leaderboard for publicly available models on our hidden evaluation set.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, NLP dataset, automatic evaluation of datasets

Contribution Types: NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 6923

Loading