Efficient Model Performance Evaluation Using a Combination of Expert and Crowd-sourced Labels

Sam Corbett-Davies; Viet-An Nguyen; Udi Weinsberg

Efficient Model Performance Evaluation Using a Combination of Expert and Crowd-sourced Labels

Sam Corbett-Davies, Viet-An Nguyen, Udi Weinsberg

Published: 03 Feb 2026, Last Modified: 30 Apr 2026AISTATS 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We develop a Bayesian model for estimating the performance of models, combining low-quality crowd-sourced labels with expert labels to reduce the bias and variance of estimates

Abstract: As models, particularly large language models (LLMs), are deployed on increasingly challenging tasks, correctly evaluating their performance is growing in importance and difficulty. Expert human labelers are high-quality but scarce and resource-intensive to obtain, while crowd-sourced labels are more readily accessible at scale but lower in quality. We propose Maven (Model And Voter EvaluatioN), a hierarchical Bayesian model that combines these two label sources to produce model performance estimates on binary tasks that are less biased than using crowd-sourced labels alone and have lower variance than using expert labels alone. By modeling the ranking of model scores, Maven is robust to a range of prediction distributions and achieves constant inference time regardless of dataset size. We validate our approach on both simulated and real-world data, and deploy it to measure production models at Meta.

Submission Number: 2023

Loading