Pearls from Pebbles: Improved Confidence Functions for Auto-labeling

Harit Vishwakarma; Yi Chen; Sui Jiet Tay; Satya Sai Srinath Namburi GNVV; Frederic Sala; Ramya Korlakai Vinayak

Pearls from Pebbles: Improved Confidence Functions for Auto-labeling

Harit Vishwakarma, Yi Chen, Sui Jiet Tay, Satya Sai Srinath Namburi GNVV, Frederic Sala, Ramya Korlakai Vinayak

Published: 13 Oct 2024, Last Modified: 02 Dec 2024NeurIPS 2024 Workshop SSLEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Auto-labeling, Labeled data curation, Calibration, Self-training, Active Learning, Self-supervised representations, Foundation models.

Abstract: Auto-labeling techniques produce labeled data with minimal manual annotations using the representations from self-supervised models and confidence scores. A popular technique, threshold-based auto-labeling (TBAL) trains model using these representations and manual annotations and assigns model's prediction as label to the points where model's confidence score is greater than a certain threshold. However, the model's scores can be overconfident and lead to poor performance. We show that calibration, a common remedy for the overconfidence problem, falls short in tackling this problem for TBAL. Thus, instead of using existing calibration methods, we introduce a framework for optimal confidence functions for TBAL and develop \texttt{Colander}, a method designed to maximize auto-labeling performance. We perform an extensive empirical evaluation of \texttt{Colander} and other confidence functions, using representations from CLIP and text embedding models for image and text data respectively. We find \texttt{Colander} achieves up to 60\% improvement on coverage (the proportion of points labeled by model) over the baselines while maintaining error level below $5\%$ and using the same amount of labeled data.

Submission Number: 64

Loading