Context-Aware Clustering using Large Language Models

TMLR Paper5457 Authors

24 Jul 2025 (modified: 04 Aug 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite the remarkable success of Large Language Models (LLMs) in text understanding and generation, their potential for text clustering tasks remains underexplored. While we observed powerful closed-source LLMs generate high-quality text clusterings, their massive size and inference cost make them impractical for repeated online use in real-world applications. Motivated by this limitation, we study the transfer of clustering knowledge from LLMs to smaller and more efficient open-source language models (SLMs), aiming to retain performance while improving scalability. We propose CACTUS (Context-Aware ClusTering with aUgmented triplet losS), a systematic approach that leverages SLMs for efficient and effective supervised clustering of entity subsets, particularly focusing on text-based entities. Existing text clustering methods fail to capture the context provided by the entity subset. In particular, they typically embed each entity independently, ignoring the mutual relationships among entities within the same subset. CACTUS incorporates a scalable inter-entity attention mechanism that efficiently models pairwise interactions to capture this context. Although several language modeling-based approaches exist for clustering, very few are designed for the task of supervised clustering. We propose a new augmented triplet loss function tailored for supervised clustering, which addresses the inherent challenges of directly applying the standard triplet loss to this problem by introducing a neutral similarity anchor. Furthermore, we introduce a self-supervised clustering pretraining task based on text augmentation techniques to improve the generalization of our model. Extensive experiments on various e-commerce query and product clustering datasets demonstrate that our proposed approach significantly outperforms existing unsupervised and supervised baselines across multiple external clustering evaluation metrics. Our results establish CACTUS as a scalable, generalizable solution for real-world clustering scenarios. Our code is publicly available at https://anonymous.4open.science/r/context-aware-clustering-E90C.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Yen-Chang_Hsu1
Submission Number: 5457
Loading