Prefix-VAE: Efficient and Consistent Short-Text Topic Modeling with LLMs

ACL ARR 2024 June Submission4882 Authors

16 Jun 2024 (modified: 10 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Topic models are compelling methods for discovering latent semantics in a document collection. However, it assumes that a document has sufficient co-occurrence information to be effective. However, in short texts, co-occurrence information is minimal, which results in feature sparsity in document representation. Therefore, existing topic models- whether probabilistic or neural- mostly struggle to mine patterns from them to generate coherent topics. In this paper, we first explore the capability of large language models (LLMs) to generate longer texts from shorter ones before applying them to traditional topic modeling. To further improve the efficiency and solve the problem of the semantic inconsistency from LLM-generated texts, we propose to use prefix tuning to train a smaller language model coupled with a variational autoencoder for short-text topic modeling. Extensive experiments on multiple real-world datasets under extreme data sparsity scenarios show that our models can generate high-quality topics that outperform state-of-the-art models.
Paper Type: Long
Research Area: Information Retrieval and Text Mining
Research Area Keywords: Generation,Information Retrieval and Text Mining
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings
Languages Studied: English
Submission Number: 4882
Loading