Efficient Model Performance Evaluation Using a Combination of Expert and Crowd-sourced Labels
TL;DR: We develop a Bayesian model for estimating the performance of models, combining low-quality crowd-sourced labels with expert labels to reduce the bias and variance of estimates
Abstract: As models, particularly large language models (LLMs), are deployed on increasingly challenging tasks, correctly evaluating their performance is growing in importance and difficulty. Expert human labelers are high-quality but scarce and resource-intensive to obtain, while crowd-sourced labels are more readily accessible at scale but lower in quality. We propose Maven (Model And Voter EvaluatioN), a hierarchical Bayesian model that combines these two label sources to produce model performance estimates on binary tasks that are less biased than using crowd-sourced labels alone and have lower variance than using expert labels alone. By modeling the ranking of model scores, Maven is robust to a range of prediction distributions and achieves constant inference time regardless of dataset size.
We validate our approach on both simulated and real-world data, and deploy it to measure production models at Meta.
Submission Number: 2023
Loading