Keywords: Auto-labeling, Labeled data curation, Calibration, Self-training, Active Learning, Self-supervised representations, Foundation models.
Abstract: Auto-labeling techniques produce labeled data with minimal manual annotations using the representations from self-supervised models and confidence scores. A popular technique, threshold-based auto-labeling (TBAL) trains model using these representations and manual annotations and assigns model's prediction as label to the points where model's confidence score is greater than a certain threshold. However, the model's scores can be overconfident and lead to poor performance. We show that calibration, a common remedy for the overconfidence problem, falls short in tackling this problem for TBAL. Thus, instead of using existing calibration methods, we introduce a framework for optimal confidence functions for TBAL and develop \texttt{Colander}, a method designed to maximize auto-labeling performance. We perform an extensive empirical evaluation of \texttt{Colander} and other confidence functions, using representations from CLIP and text embedding models for image and text data respectively. We find \texttt{Colander} achieves up to 60\% improvement on coverage (the proportion of points labeled by model) over the baselines while maintaining error level below $5\%$ and using the same amount of labeled data.
Submission Number: 64
Loading