IOCC: Aligning Semantic and Cluster Centers for Few-shot Short Text Clustering

ACL ARR 2025 May Submission3366 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In clustering tasks, it is essential to structure the feature space into clear, well-separated distributions. However, because short text representations have limited expressiveness, conventional methods struggle to identify cluster centers that truly capture each category’s underlying semantics, causing the representations to be optimized in suboptimal directions. To address this issue, we propose __IOCC__, a novel few-shot contrastive learning method that achieves alignment between the cluster centers and the semantic centers. IOCC consists of two key modules: Interaction-enhanced Optimal Transport (__IEOT__) and Center-aware Contrastive Learning (__CACL__). Specifically, IEOT incorporates semantic interactions between individual samples into the conventional optimal transport problem, and generate pseudo-labels. Based on these pseudo-labels, we aggregate high-confidence samples to construct _pseudo-centers_ that approximate the semantic centers. Next, CACL optimizes text representations toward their corresponding _pseudo-centers_. As training progresses, the collaboration between the two modules gradually reduces the gap between cluster centers and semantic centers. Therefore, the model will learn a high-quality distribution, improving clustering performance. Extensive experiments on eight benchmark datasets show that IOCC outperforms previous methods, achieving up to __7.34\%__ improvement on challenging Biomedical dataset and also excelling in clustering stability and efficiency. The code is available at: https://anonymous.4open.science/r/IOCC-C438.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: few-shot learning, representation learning
Languages Studied: English
Keywords: few-shot learning, representation learning, text clustering
Submission Number: 3366
Loading