Can semi-supervised learning use all the data effectively? A lower bound perspective

Alexandru Tifrea; Gizem Yüce; Amartya Sanyal; Fanny Yang

Can semi-supervised learning use all the data effectively? A lower bound perspective

Alexandru Tifrea, Gizem Yüce, Amartya Sanyal, Fanny Yang

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 spotlightEveryoneRevisionsBibTeX

Keywords: semi-supervised learning, statistical lower bound

TL;DR: Adaptive statistical lower bound for semi-supervised learning linear classification in 2-GMM data model.

Abstract: Prior theoretical and empirical works have established that semi-supervised learning algorithms can leverage the unlabeled data to improve over the labeled sample complexity of supervised learning (SL) algorithms. However, existing theoretical work focuses on regimes where the unlabeled data is sufficient to learn a good decision boundary using unsupervised learning (UL) alone. This begs the question: Can SSL algorithms simultaneously improve upon both UL and SL? To this end, we derive a tight lower bound for 2-Gaussian mixture models that explicitly depends on the labeled and the unlabeled dataset size as well as the signal-to-noise ratio of the mixture distribution. Surprisingly, our result implies that no SSL algorithm improves upon the minimax-optimal statistical error rates of SL or UL algorithms for these distributions. Nevertheless, in our real-world experiments, SSL algorithms can often outperform UL and SL algorithms. In summary, our work suggests that while it is possible to prove the performance gains of SSL algorithms, this would require careful tracking of constants in the theoretical analysis.

Supplementary Material: pdf

Submission Number: 13797

Loading