Building topic mixture language models using the document soft classification notion of topic models

Shuanhu Bai, Cheung-Chi Leung, Chien-Lin Huang, Bin Ma, Haizhou Li

Published: 2010, Last Modified: 15 May 2023ISCSLP 2010Readers: Everyone

Abstract: We present a topic mixture language modeling approach making use of the soft classification notion of topic models. Given a text document set, we first perform document soft classification by applying a topic modeling process such as probabilistic latent semantic analyses (PLSA) or latent Dirichlet allocation (LDA) on the dataset. Then we can derive topic-specific n-gram counts from the classified texts. Finally we build topic-specific n-gram language models (LM) from the n-gram counts using traditional n-gram modeling approach. In decoding we perform topic inference from the processing context, and we use unsupervised topic adaptation approach to combine the topic-specific models. Experimental results show that the suggested method outperforms the state-of-the-art topic-model-based unsupervised adaptation approaches.

0 Replies