Coherence-Based Document Clustering

Anton Frederik Thielmann; Christoph Weisser; Thomas Kneib; Benjamin Saefken

Coherence-Based Document Clustering

Anton Frederik Thielmann, Christoph Weisser, Thomas Kneib, Benjamin Saefken

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Topic Modeling, LDA, Transformers, Coherence, Document Clustering

Abstract: Latent Dirichlet Allocation or Non-negative Matrix Factorization are just two widely used algorithms for extracting latent topics from large text corpora. While these algorithms differ in their modeling approach, they have in common that hyperparameter optimization is difficult and is mainly achieved by maximizing the extracted topic coherence scores via a grid search. Models using word-document embeddings can automatically detect the number of latent topics, but tend to have problems with smaller datasets and often require pre-trained embedding layers for successful topic extraction. We leverage widely used coherence scores by integrating them into a novel document-level clustering approach using keyword extraction methods. The metric by which most topic extraction methods optimize their hyperparameters is thus optimized during clustering, resulting in ultra-coherent clusters. Moreover, unlike traditional methods, the number of extracted topics or clusters does not need to be determined in advance, saving us an additional optimization step and a time- and computationally-intensive grid search. Additionally, the number of topics is detected much more accurately than by models leveraging word-document embeddings.

One-sentence Summary: This paper presents a document clustering algorithm that leverages widely used coherence scores by integrating them into a novel document-level clustering approach using keyword extraction methods.

4 Replies

Loading