Abstract: Deep semi-supervised learning (SSL) brings deep learning from lab with expensive label data costs to real-world commercial application. Today, deep SSL is being universally applied in various artificial intelligence commercial technologies. However, there may be a distribution mismatch between labeled and unlabeled datasets in practical application, which is a key issue that degrades deep SSL performance. Some recent studies deal with out-of-distribution (OOD) data by directly removing or uniformly reducing weights, which ignore potential value of OOD data. To address the issue, we propose ITSMatch, a simple, safe and effective SSL method to process text classification by recycling OOD data near labeled domain to fully utilize data information. Specifically, a weighted adversarial domain adaptation is applied to OOD data to project it into the space of labeled and in-distribution (ID) data, and its recover ability is accurately quantified by the transferable score. ITSMatch unifies mainstream methods, including pseudo-labels generation and consistency regularization on unlabeled data and its augmented data. Besides, we also perform metric learning on labeled data and ID data with pseudo-labels to fully acquire sample space features. Experiment results on the AG News and Yelp datasets demonstrate that our ITSMatch method performs better than the baseline methods including TextSMatch, MixText, UDA, and BERT. This method of semi-supervised text classification can be applied to the analysis of product reviews on e-commerce platforms to improve customers’ online shopping experience.
Loading