Language Model Preference Evaluation with Multiple Weak Evaluators

Zhengyu Hu; Jieyu Zhang; Zhihan Xiong; Alexander Ratner; Hui Xiong; Ranjay Krishna

Language Model Preference Evaluation with Multiple Weak Evaluators

Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Hui Xiong, Ranjay Krishna

Published: 06 Mar 2025, Last Modified: 30 Apr 2025ICLR 2025 Workshop Data Problems PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Preference evaluation, Weak supervision

TL;DR: We introduce GED, a framework that enhances LLM-based evaluations by ensembling and denoising preference graphs, improving the consistency and reliability of LLM evaluators.

Abstract: Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs' quality regarding *preference* remains a critical challenge. Existing works usually leverage an LLM as the judge for comparing LLMs' output pairwisely, yet such model-based evaluator is *weak evaluator* due to *conflicting preference*, i.e., output A is better than B, B than C, but C than A, causing contradictory evaluation results. To address this, we introduce GED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensemble and denoise these graphs for better, non-contradictory evaluation results. In particular, our method consists of two primary stages: aggregating evaluations into a unified graph and applying a denoising process to eliminate cyclic inconsistencies, ensuring a directed acyclic graph (DAG) structure. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks demonstrate GED's superiority in three applications: model ranking, response selection, and model alignment tasks. Notably, GED combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to outperform stronger ones (e.g., Qwen2-72B), showcasing its effectiveness in enhancing evaluation reliability and improving model performance.

Submission Number: 4

Loading