A Prior Setting that Improves LDA in both Document Representation and Topic Extraction

Juncheng Ding, Wei Jin

Published: 01 Jan 2019, Last Modified: 30 Jul 2025IJCNN 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Latent Dirichlet Allocation (LDA), as the most popular topic model, models documents as mixtures of topics and topics as mixtures of words. Topic mixture well represents documents while words mixture extracts meaningful topics from a corpus. The nature of LDA makes it a powerful tool in documents organizing and corpus summarizing. One limitation of LDA is that its performance depends heavily on the priors. Researchers show priors matters in LDA and propose methods to learn the priors for better modeling, regardless of using symmetric priors. However, LDA modeling ability does not necessarily consent with the performance of LDA in documents representation and topic extraction. In this paper, we propose a novel prior setting for LDA. The setting improves LDA in both documents representation and topic extraction performance. We experiment to compare our setting with symmetric priors and previously proposed priors that enhances modeling ability. Experiments on the topic quality show that LDA with our prior setting extracts better topics than LDA with other kinds of prior settings. We compare LDA document representation ability through tasks such as document clustering and document classification. These experiments demonstrate LDA with our proposed priors represents document better. Moreover, our analyses also reveal that better modeling does not necessarily lead to better performance in documents representation and topic extraction.