An Enhanced BERTopic Framework and Algorithm for Improving Topic Coherence and Diversity

Sahil Sawant, Jinhong Yu, Kirtikumar Pandya, Chun-Kit Ngan, Rolf Bardeli

Published: 01 Jan 2022, Last Modified: 05 Oct 2025HPCC/DSS/SmartCity/DependSys 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we enhance and customize the existing BERTopic framework to develop and implement an automated pipeline that delivers a more coherent and diverse set of topics with an even moderate dataset. More specifically, the contributions of this work are threefold: (1) integrate a dynamic and advanced optimizer into the existing BERTopic framework to learn the optimal number of dimensions of different document embeddings, (2) develop a k-means-based algorithm in the optimizer to support the dimension-embedding learning, and (3) conduct an extensive experimental study on three distinct types of datasets, including DBPedia, AG News, and Reuters, to evaluate the performance of our approach in terms of the topic quality (TQ) score computed by the topic coherence and the topic diversity. From the results, we can conclude that our enhanced, automated BERTopic framework with its dimension-embedding learning algorithm on documents outperforms the TQ score of the existing framework by 4.49% (before removing the stop words) and 16.52% (after removing the stop words) among all the four representable document-embedding approaches, including the BERTopic's Default Sentence Transformer, Google's Universal Sentence Encoder, OpenAI GPT-2, and our investigators' developed Context-aware Embedding Model, on all the three datasets.

External IDs:dblp:conf/hpcc/SawantYPNB22