Efficient model performance evaluation using a combination of expert and crowd-sourced labels

Published: 03 Feb 2026, Last Modified: 03 Feb 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We develop a Bayesian model for estimating the performance of models, combining low-quality crowd-sourced labels with expert labels to reduce the bias and variance of estimates
Abstract: As models, particularly large language models (LLMs), are deployed on increasingly challenging tasks, correctly evaluating their performance is growing in importance and difficulty. Expert human labelers are high-quality but scarce and expensive, while crowd-sourced labels are cheaper at scale but lower in quality. This paper proposes Maven (Model and Voter EvaluatioN), a hierarchical Bayesian model that combines these two label sources to produce estimates of model performance on binary tasks that are less biased than using crowd-sourced labels alone and have lower variance than using high-quality labels alone. By modeling the ranking induced by model predictions instead of their raw values, our approach is robust to a range of prediction distributions and achieves constant inference time regardless of dataset size. The Maven model enables the imputation of missing high-quality labels, allowing the estimation of a comprehensive suite of performance metrics. We validate our approach on both simulated data and production models at a major technology company. In one production model, Maven's estimate of model performance achieved equal variance and indistinguishable point estimates compared to the expert-only estimate, while reducing labeling cost by 42\%. Our results show that Maven is a practical solution for cost-effective, high-quality model evaluation at scale.
Submission Number: 2023
Loading