Evaluating multiple models using labeled and unlabeled data

Divya M Shanmugam; Shuvom Sadhuka; Manish Raghavan; John Guttag; Bonnie Berger; Emma Pierson

Evaluating multiple models using labeled and unlabeled data

Divya M Shanmugam, Shuvom Sadhuka, Manish Raghavan, John Guttag, Bonnie Berger, Emma Pierson

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: evaluation, mixture models, semi-supervised learning

TL;DR: We propose a new evaluation method that makes use of three sources of information (unlabeled data, multiple classifiers, and probabilistic classifier scores) to produce more accurate performance estimates than prior work.

Abstract: It is difficult to evaluate machine learning classifiers without large labeled datasets, which are often unavailable. In contrast, unlabeled data is plentiful, but not easily used for evaluation. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. The key idea is to estimate the joint distribution of ground truth labels and classifier scores using a semi-supervised mixture model. The semi-supervised mixture model allows SSME to learn from three sources of information: unlabeled data, multiple classifiers, and probabilistic classifier scores. Once fit, the mixture model enables estimation of any metric that is a function of classifier scores and ground truth labels (e.g., accuracy or AUC). We derive theoretical bounds on the error of these estimates, showing that estimation error decreases with the number of classifiers and the amount of unlabeled data. We present experiments in four domains where obtaining large labeled datasets is often impractical: healthcare, content moderation, molecular property prediction, and text classification. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1x relative to using labeled data alone and 2.4x relative to the next best method.

Supplementary Material: zip

Primary Area: Evaluation (e.g., methodology, meta studies, replicability and validity, human-in-the-loop)

Submission Number: 22739

Loading