Direct Judgement Preference Optimization

PeiFeng Wang; Austin Xu; Yilun Zhou; Caiming Xiong; Shafiq Joty

Direct Judgement Preference Optimization

PeiFeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, Shafiq Joty

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM-as-judge, generative judge, auto-evaluation

TL;DR: Using DPO, we train a family of high-performing generative LLM judge models capable of pairwise, single rating, and classification tasks

Abstract: Auto-evaluation is crucial for assessing response quality and offering feedback for model development. Recent studies have explored training large language models (LLMs) as generative judges to both evaluate model responses and generate natural language critiques. However, existing models have been trained almost exclusively with supervised fine-tuning (SFT), often only on a small number of datasets, resulting in poor generalization across different evaluation settings and tasks. In this paper, we investigate how learning from both positive and negative data with direct preference optimization (DPO) enhances the evaluation capabilities of LLM judges across three evaluation tasks: pairwise, single ratings, and binary classification. We achieve this by creating three forms of DPO data from a diverse collection of human and synthetic judgements on contemporary model outputs, with the goal of training our model to generate meaningful critiques, make accurate judgements, and understand what constitutes good and bad responses for a given user input. To demonstrate the effectiveness of our method, we train judge models of three sizes: 8B parameters, 12B, and 70B, and conduct a comprehensive study over 13 benchmarks (7 pairwise, 4 single rating, and 2 classification), measuring agreement with human and GPT-4 annotations. Our models exhibit the best aggregate performance, with even our 8B model outperforming strong baselines like GPT-4o and specialized judge models, such as OffsetBias-8B, Auto-J-13B, Prometheus-2-8x7B, and Skywork-Critic-70B, in pairwise benchmarks. Further analysis shows that our judge model robustly counters biases such as position and length bias, flexibly adapts to practitioner-specified evaluation protocols, and provides helpful language feedback for improving downstream generator models.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12124

Loading