Open-Domain Semi-Supervised Learning via Glocal Cluster Structure Exploitation

Zekun Li, Lei Qi, Yawen Li, Yinghuan Shi, Yang Gao

Published: 2024, Last Modified: 25 Jan 2026IEEE Trans. Knowl. Data Eng. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Semi-supervised learning (SSL) aims to reduce the heavy reliance of current deep models on costly manual annotation by leveraging a large amount of unlabeled data in combination with a much smaller set of labeled data. However, most existing SSL methods assume that all labeled and unlabeled data are drawn from the same feature distribution, which can be impractical in real-world applications. In this study, we take the initial step to systematically investigate the open-domain semi-supervised learning setting, where a feature distribution mismatch exists between labeled and unlabeled data. In pursuit of an effective solution for open-domain SSL, we propose a novel framework called GlocalMatch, which aims to exploit both global and local (i.e., glocal) cluster structure of open-domain unlabeled data. The glocal cluster structure is utilized in two complementary ways. First, GlocalMatch optimizes a Glocal Cluster Compacting (GCC) objective, that encourages feature representations of the same class, whether with in the same domain or across different domains, to become closer to each other. Second, GlocalMatch incorporates a Glocal Semantic Aggregation (GSA) strategy to produce more reliable pseudo-labels by aggregating predictions from neighboring clusters. Extensive experiments demonstrate that GlocalMatch outperforms the state-of-the-art SSL methods significantly, achieving superior performance for both in-domain and out-of-domain generalization.