Optimizing Semantic Coherence in Topic Models

David M. Mimno, Hanna M. Wallach, Edmund M. Talley, Miriam Leenders, Andrew McCallum

2011 (modified: 10 Nov 2022)EMNLP 2011Readers: Everyone

Abstract: Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).

0 Replies