Self-Paced Pairwise Representation Learning for Semi-Supervised Text Classification

Published: 23 Jan 2024, Last Modified: 23 May 2024TheWebConf24EveryoneRevisionsBibTeX
Keywords: Text Classification, Semi-supervised Learning, Self-paced Learning
TL;DR: This paper integrates pairwise representation learning and self-paced text filtering to tackle overfitting and mislabeling problems in semi-supervised text classification.
Abstract: Text classification is one vital tool assisting web content mining. Modern deep learning approaches heavily rely on ample annotated data, which often comes at a considerable cost. Semi-supervised text classification (SSTC) offers an approach to alleviate the burden of annotation costs by harnessing the power of effective classifiers trained on a limited number of labeled texts alongside a vast pool of unlabeled texts. While existing SSTC methods have shown effectiveness by training a classifier on labeled texts and boosting the model with pseudo-labeled data derived from unlabeled texts, potential unsolved challenges are the overfitting problem caused by the limited availability of labeled data during training and the mislabeling problem stemming from an unreliable pseudo-labeling process. To address these issues, this paper proposes a Self-Paced PairWise representation learning (SPPW) model. Concretely, SPPW alleviates the overfitting problem by replacing the overfitting-prone learning of a parameterized classifier with representation learning in a pair-wise manner. Besides, our findings highlight the potential of utilizing text hardness as a complementary criterion to filter out unreliable texts upon existing confidence-based methods. With this insight, we propose a novel self-paced text filtering method that effectively integrates both label confidence and text hardness to reduce mislabeled texts synergistically. Extensive experiments on 3 benchmark SSTC datasets show that SPPW outperforms baselines and is effective in mitigating overfitting and mislabeling problems.
Track: Web Mining and Content Analysis
Submission Guidelines Scope: Yes
Submission Guidelines Blind: Yes
Submission Guidelines Format: Yes
Submission Guidelines Limit: Yes
Submission Guidelines Authorship: Yes
Student Author: No
Submission Number: 2100
Loading