Direct Judgement Preference Optimization

ACL ARR 2025 February Submission1539 Authors

13 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: To meet the increasing need for timely and accurate evaluation of large language model (LLM) responses, training LLMs themselves to evaluate and critique other model responses has emerged as a popular paradigm known as LLM-as-judge. However, existing judge models are largely trained with supervised finetuning (SFT) to perform limited types of evaluation tasks. In this paper, we investigate how learning from paired preference data via direct preference optimization (DPO) enhances the evaluation capabilities of judge models for three evaluation tasks: pairwise, single rating, and binary classification. Using four training tasks, including a novel response deduction task, we form three types of DPO preference pairs targeting different aspects of evaluation: Generating meaningful critiques, making accurate judgements, and understanding what comprises good and bad responses. To demonstrate the effectiveness of our method, we train judge models of three sizes: 8B parameters, 12B, and 70B, and evaluate on a comprehensive suite of 13 benchmarks (7 pairwise, 4 single rating, and 2 classification), measuring agreement with humans and GPT-4. Our models achieve the best aggregate performance, with even our 8B model outperforming GPT-4o and Skywork-Critic-70B in pairwise benchmarks. Further analysis shows that our judge models robustly counter biases such as position and length bias, and produce factual and actionable critiques.
Paper Type: Long
Research Area: Generation
Research Area Keywords: automatic evaluation
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 1539
Loading