ClusterLLM: Large Language Models as a Guide for Text Clustering

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX
Submission Type: Regular Long Paper
Submission Track: Information Retrieval and Text Mining
Keywords: text clustering, large language model, sentence relation, entropy-based sampling
TL;DR: We utilize sentence relations to query instruction-tuned large language models, and then leverage the feedbacks to improve clustering quality and determine cluster granularity
Abstract: We introduce ClusterLLM, a novel text clustering framework that leverages feedback from an instruction-tuned large language model, such as ChatGPT. Compared with traditional unsupervised methods that builds upon "small" embedders, ClusterLLM exhibits two intriguing advantages: (1) it enjoys the emergent capability of LLM even if its embeddings are inaccessible; and (2) it understands the user's preference on clustering through textual instruction and/or a few annotated data. First, we prompt ChatGPT for insights on clustering perspective by constructing hard triplet questions $<$does A better correspond to B than C$>$, where A, B and C are similar data points that belong to different clusters according to small embedder. We empirically show that this strategy is both effective for fine-tuning small embedder and cost-efficient to query ChatGPT. Second, we prompt ChatGPT for helps on clustering granularity by carefully designed pairwise questions $<$do A and B belong to the same category$>$, and tune the granularity from cluster hierarchies that is the most consistent with the ChatGPT answers. Extensive experiments on $14$ datasets show that ClusterLLM consistently improves clustering quality, at an average cost of $\sim$\$0.6 per dataset.
Submission Number: 3951
Loading