Towards a Theoretical Understanding of Semi-Supervised Learning Under Class Distribution Mismatch

Pan Du

Published: 31 May 2025, Last Modified: 13 Nov 2025TPAMIEveryoneCC BY 4.0

Abstract: Semi-supervised learning (SSL) confronts a formidable challenge under class distribution mismatch, wherein unlabeled data contain numerous categories absent in the labeled dataset. Traditional SSL methods undergo performance deterioration in such mismatch scenarios due to the invasion of those instances from unknown categories. Despite some technical efforts to enhance SSL by mitigating the invasion, the profound theoretical analysis of SSL under class distribution mismatch is still under study. Accordingly, in this work, we propose Bi-Objective Optimization Mechanism (BOOM) to theoretically analyze the excess risk between the empirical optimal solution and the population-level optimal solution. Specifically, BOOM reveals that the SSL error is the essential contributor behind excess risk, resulting from both the pseudo-labeling error and invasion error. Meanwhile, BOOM unveils that the optimization objectives of SSL under mismatch are binary: high-quality pseudo-labels and adaptive weights on the unlabeled instances, which contribute to alleviating the pseudo-labeling error and the invasion error, respectively. Moreover, BOOM explicitly discovers the fundamental factors crucial for optimizing the bi-objectives, guided by which an approach is then proposed as a strong baseline for SSL under mismatch. Extensive experiments on benchmark and real datasets confirm the effectiveness of our proposed algorithm.