Coherence based Document ClusteringDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 12 May 2023ICSC 2023Readers: Everyone
Abstract: Unsupervised document clustering and subsequent topic extraction has gained increasing importance with the ever growing availability of large text corpora. While there are a variety of algorithms that achieve meaningful results in either document clustering and subsequent topic extraction or classical topic modeling, most of them have in common that hyperparameter optimization is difficult and is mainly achieved by maximizing the extracted topic coherence values via grid search. Models using word-document embeddings can automatically detect the number of latent topics, but tend to have problems with smaller datasets and often require pre-trained embedding layers for successful topic extraction. We leverage widely used coherence scores by integrating them into a novel document-level clustering approach that uses keyword extraction methods for small to medium sized datasets. The metric by which most topic extraction methods optimize their hyperparameters is thus optimized during clustering, resulting in ultra-coherent clusters. Moreover, unlike traditional methods, the number of extracted topics or clusters does not need to be determined in advance, saving an additional optimization step and a time- and computationally-intensive grid search. Additionally, we show that the number of topics is accurately detected automatically.
0 Replies

Loading