k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering

Jairo Diaz Rodriguez

k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering

Jairo Diaz Rodriguez

Published: 01 Jan 2025, Last Modified: 13 May 2025CoRR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We introduce k-LLMmeans, a novel modification of the k-means algorithm for text clustering that leverages LLM-generated summaries as cluster centroids, capturing semantic nuances often missed by purely numerical averages. This design preserves the core optimization properties of k-means while enhancing semantic interpretability and avoiding the scalability and instability issues typical of modern LLM-based clustering. Unlike existing methods, our approach does not increase LLM usage with dataset size and produces transparent intermediate outputs. We further extend it with a mini-batch variant for efficient, real-time clustering of streaming text. Extensive experiments across multiple datasets, embeddings, and LLMs show that k-LLMmeans consistently outperforms k-means and other traditional baselines and achieves results comparable to state-of-the-art LLM-based clustering, with a fraction of the LLM calls. Finally, we present a case study on sequential text streams and introduce a new benchmark dataset constructed from StackExchange to evaluate text-stream clustering methods.

Loading