Assessing the Uncertainty of the Text Generating Process Using Topic ModelsOpen Website

Published: 01 Jan 2020, Last Modified: 29 Jan 2024PKDD/ECML Workshops 2020Readers: Everyone
Abstract: Latent Dirichlet Allocation (LDA) is one of the most popular topic models employed for the analysis of large text data. When applied repeatedly to the same text corpus, LDA leads to different results. To address this issue, several methods have been proposed. In this paper, instead of dealing with this methodological source of algorithmic uncertainty, we assess the aleatoric uncertainty of the text generating process itself. For this task, we use a direct LDA-model approach to quantify the uncertainty due to the random process of text generation and propose three different bootstrap approaches to resample texts. These allow to construct uncertainty intervals of topic proportions for single texts as well as for text corpora over time. We discuss the differences of the uncertainty intervals derived from the three bootstrap approaches and the direct approach for single texts and for aggregations of texts. We present the results of an application of the proposed methods to an example corpus consisting of all published articles in a German daily quality newspaper of one full year and investigate the effect of different sample sizes to the uncertainty intervals.
0 Replies

Loading