Out-of-vocabulary handling and topic quality control strategies in streaming topic models

Published: 01 Jan 2025, Last Modified: 18 May 2025Neurocomputing 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Topic models have become ubiquitous tools for analyzing streaming data. However, existing streaming topic models suffer from several limitations when applied to real-world data streams. This includes the inability to accommodate evolving vocabularies and control topic quality throughout the streaming process. In this paper, we propose a novel streaming topic modeling approach that dynamically adapts to the changing nature of data streams. Our method leverages Byte-Pair Encoding embedding (BPEmb) to resolve the out-of-vocabulary problem that arises with new words in the stream. Additionally, we introduce a topic change variable that provides fine-grained control over topics’ parameter updates and present a preservation approach to retain high-coherence topics at each time step, helping preserve semantic quality. To further enhance model adaptability, our method allows dynamical adjustment of topic space size as needed. To the best of our knowledge, we are the first to address the expansion of vocabulary and maintain topic quality during the streaming process. Extensive experiments show the superior effectiveness of our method.
Loading