TL;DR: We construct an unbiased LLM evaluation method with synthetic feedback to reduce human annotation cost.
Abstract: When developing new large language models (LLMs), a key step is evaluating their final performance, often by computing the win-rate against a reference model based on external feedback. Human feedback is the gold standard, particularly for capturing nuanced qualities like coherence, readability, and alignment with human expectations. However, human evaluations are costly—even for large tech companies—and when conducted with active users, they may negatively impact user experience.
A promising alternative is synthetic feedback, where evaluations are conducted by other large language models, including reward models. While this eliminates the need for costly human annotations, it introduces biases that may distort the evaluation process.
In this work, we propose a statistically principled framework that integrates human and synthetic feedback to reduce reliance on human annotations while maintaining unbiased win-rate calculations.
Our experiments demonstrate a reduction in human annotations by up to 12.2\% with an off-the-shelf synthetic evaluator and up to 24.8\% with a finetuned variant. Apart from being generalizable, scalable, and free of hyper-parameter tuning, our method offers predictable annotation savings, which can be estimated based on data-dependent characteristics.
Lay Summary: Evaluating how well a new AI language model performs—especially compared to existing models—usually requires human judgment. People are great at spotting whether responses sound natural, make sense, and align with what users expect. But relying on human feedback is expensive and time-consuming, even for big tech companies. It can also interfere with the experience of real users.
To address this, researchers often use other AI systems to provide feedback instead of humans. While this saves time and money, it can introduce hidden biases that affect how fairly the new model is judged.
In this work, we introduce a new approach that combines human and AI feedback in a statistically sound way. Our method reduces the need for human involvement while still producing reliable and unbiased results. We show that it can cut the amount of human feedback needed by up to 12% with a public model and up to 25% with slight training. This makes it easier and cheaper to evaluate AI models without sacrificing accuracy.
Link To Code: https://github.com/Zanette-Labs/control_variates_evaluation
Primary Area: Deep Learning->Large Language Models
Keywords: LLM evaluation, synthetic evaluation, variance reduction
Submission Number: 5100
Loading