From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-SA 4.0
TL;DR: We propose a method for developing an LLM-as-a-Judge metric which is specialized to a given test set, and show that this metric significantly outperforms non-specialized metrics.
Abstract: As LLMs continue to become more powerful and versatile, human evaluation has become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. These *Autoraters* are typically designed so that they generalize to new systems *and* test sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure the capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate our *Specialist* method on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the state-of-the-art XCOMET metric by 54% and 119% on the WMT'23 and WMT'24 test sets, respectively. We perform extensive analyses to understand the representations learned by our Specialist metrics, and how variability in rater behavior affects their performance. We also verify the generalizability and robustness of our Specialist method across different numbers of ICL examples, LLM backbones, systems to evaluate, and evaluation tasks.
Lay Summary: As large language models (LLMs) continue to become more powerful and versatile, evaluating their performance across the wide range of tasks they are capable of becomes increasingly challenging. Traditionally, human evaluation has been considered the gold standard for evaluating LLM capabilities, but it has become intractable at scale and reliance on automatic metrics has become the norm. Automatic metrics can take many forms, from rule-based to model-based approaches. In the latter category, LLMs have themselves been shown to be state-of-the-art evaluators for many tasks. The key observation motivating this work is that, even though LLMs are typically evaluated using automatic metrics on standard test sets, the metrics and test sets are developed independently. This raises a crucial question: Can we design automatic metrics specifically to excel on the test sets we prioritize? We show that the answer is yes, by introducing the “Specialist” method for creating an LLM-based automatic metric. This method tailors an automatic metric to a specific test set by leveraging historical ratings on the test set for the same source segments as in-context learning examples. (These example are used, along with the evaluation task instruction, to prompt the LLM-based automatic metric.) We evaluate our Specialist method on the task of machine translation evaluation and show that it outperforms the existing state-of-the-art automatic metric for this task by a large margin. This work advances research on how to develop automatic metrics which perform even better than humans for evaluation of LLM capabilities. This work also shows how a small human evaluation budget can be used to directly improve automatic metrics, rather than continually relying on humans every time a new LLM needs to be evaluated.
Primary Area: General Machine Learning->Evaluation
Keywords: LLM-as-a-Judge, Autorater, Fine-grained evaluation, Machine Translation, In-context learning
Submission Number: 12337
Loading