Direct Judgement Preference Optimization

ICLR 2025 Conference Submission12124 Authors

27 Sept 2024 (modified: 18 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM-as-judge, generative judge, auto-evaluation
TL;DR: Using DPO, we train a family of high-performing generative LLM judge models capable of pairwise, single rating, and classification tasks
Abstract: Auto-evaluation is crucial for assessing response quality and offering feedback for model development. Recent studies have explored training large language models (LLMs) as generative judges to both evaluate model responses and generate natural language critiques. However, existing models have been trained almost exclusively with supervised fine-tuning (SFT), often only on a small number of datasets, resulting in poor generalization across different evaluation settings and tasks. In this paper, we investigate how learning from both positive and negative data with direct preference optimization (DPO) enhances the evaluation capabilities of LLM judges across three evaluation tasks: pairwise, single ratings, and binary classification. We achieve this by creating three forms of DPO data from a diverse collection of human and synthetic judgements on contemporary model outputs, with the goal of training our model to generate meaningful critiques, make accurate judgements, and understand what constitutes good and bad responses for a given user input. To demonstrate the effectiveness of our method, we train judge models of three sizes: 8B parameters, 12B, and 70B, and conduct a comprehensive study over 13 benchmarks (7 pairwise, 4 single rating, and 2 classification), measuring agreement with human and GPT-4 annotations. Our models exhibit the best aggregate performance, with even our 8B model outperforming strong baselines like GPT-4o and specialized judge models, such as OffsetBias-8B, Auto-J-13B, Prometheus-2-8x7B, and Skywork-Critic-70B, in pairwise benchmarks. Further analysis shows that our judge model robustly counters biases such as position and length bias, flexibly adapts to practitioner-specified evaluation protocols, and provides helpful language feedback for improving downstream generator models.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12124
Loading