Topic Modeling Applied to Reddit Posts

Maria Kedzierska, Mikolaj Spytek, Marcelina Kurek, Jan Sawicki, Maria Ganzha, Marcin Paprzycki

Published: 2023, Last Modified: 20 May 2025BDA (Astronomy, Science, and Engineering) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Text data is widely used for both commercial and research purposes. While extensive sources of text data are available within Internet forums, such as Reddit, their volume is vast and, typically, only a small subset of posts is studied. To overcame problem of data size, topic modeling can be applied, to extract the main ideas from the documents. However, as it will be shown, different modeling techniques may produce very different results. Specifically, in this contribution, an overview of the most popular topic models, used in natural language processing, and methods for their comparison, is provided. Moreover, a software solution for downloading, modeling, exploring, and comparing topics, contained in Reddit posts, is introduced. The proposed application is experimentally validated, by showing that the extracted topics reflect real-world events. Finally, obtained results are compared to these originating from a different tool, used for investigating topic popularity.