Closing the Evaluation Gap: Ensembling LLM-Judges Generates More Reliable Inference-Time Reference-Free Critiques
Abstract: LLM-as-a-Judge allows for efficient and scalable natural language evaluations of complex generated outputs, such as code, without the need for a ground-truth reference. These evaluation protocols have become a crucial part for inference-time refinement approaches like prompt optimization. However, an important question arises of whether a pre-trained LLM can generate a reliable evaluation of the output.
In this work, we derive an interesting, insightful result showing that a single LLM-based judge is insufficient in generating an optimal critique. We then provide a solution by demonstrating that aggregating multiple LLM-generated evaluations can better model the optimal critique. We empirically show the merits of ensembling multiple LLM-judges via prompt optimization experiments for code generation. Ensembling judges leads to up to a ~9% increase in solved coding problems over using a single-judge. We perform ablations utilizing different aggregation methods and diverse evaluation instructions, emphasizing the non-trivial design of ensembling LLM-judges to suggest further research. We provide anonmyzied code: https://anonymous.4open.science/r/ensemble_eval-891B/ReadMe.md
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: evaluation, prompting, optimization methods
Contribution Types: Theory
Languages Studied: English
Submission Number: 2248
Loading