Closing the Evaluation Gap: Ensembling LLM-Judges Generates More Reliable Inference-Time Reference-Free Critiques

Closing the Evaluation Gap: Ensembling LLM-Judges Generates More Reliable Inference-Time Reference-Free Critiques

ACL ARR 2024 December Submission2248 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: LLM-as-a-Judge allows for efficient and scalable natural language evaluations of complex generated outputs, such as code, without the need for a ground-truth reference. These evaluation protocols have become a crucial part for inference-time refinement approaches like prompt optimization. However, an important question arises of whether a pre-trained LLM can generate a reliable evaluation of the output. In this work, we derive an interesting, insightful result showing that a single LLM-based judge is insufficient in generating an optimal critique. We then provide a solution by demonstrating that aggregating multiple LLM-generated evaluations can better model the optimal critique. We empirically show the merits of ensembling multiple LLM-judges via prompt optimization experiments for code generation. Ensembling judges leads to up to a ~9% increase in solved coding problems over using a single-judge. We perform ablations utilizing different aggregation methods and diverse evaluation instructions, emphasizing the non-trivial design of ensembling LLM-judges to suggest further research. We provide anonmyzied code: https://anonymous.4open.science/r/ensemble_eval-891B/ReadMe.md

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: evaluation, prompting, optimization methods

Contribution Types: Theory

Languages Studied: English

Submission Number: 2248

Loading