AutoEval Done Right: Using Synthetic Data for Model Evaluation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper introduces statistical methods to combine a small amount of human-labeled data with large-scale AI-generated synthetic labels to produce more precise and statistically valid model evaluations.
Abstract: The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased.
Lay Summary: This paper introduces a method to evaluate machine learning models more efficiently by combining a small set of human-annotated data with a larger set of AI-generated synthetic labels. The core idea is to use the human data to correct biases present in the synthetic labels, leveraging a statistical technique called prediction-powered inference. This approach is demonstrated across diverse applications, including ranking computer vision models, evaluating protein fitness predictors, and assessing large language models via pairwise comparisons from the Chatbot Arena. Results show that this method produces more accurate performance estimates and tighter confidence intervals than traditional evaluation techniques, allowing for more reliable model evaluation with reduced human effort.
Link To Code: https://github.com/pierreboyeau/autoeval
Primary Area: General Machine Learning->Evaluation
Keywords: Prediction-powered inference, model evaluation, large language models, annotation, synthetic data, statistical inference
Submission Number: 9143
Loading