Evaluating Topic Modeling Pre-processing Pipelines for Portuguese Texts

Antônio Pereira De Souza Júnior, Pablo Cecilio, Felipe Viegas, Washington Cunha, Elisa Tuler de Albergaria, Leonardo Chaves Dutra da Rocha

Published: 2022, Last Modified: 21 Jan 2026WebMedia 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Topic Modeling (TM) is among the most exploited approaches to extracting and organizing information from large amounts of data. Basically, these approaches aim to find semantic topics from textual documents (e.g., product reviews, tweets). Despite the good results of these approaches in English texts, we do not observe the same semantic quality when applied in Portuguese Texts since they are more verbose, presenting varied and complex verb conjugations and many homonyms, among other specific particularities. This work intends to fill this scientific gap by exploiting and evaluating different Topic Modeling Pre-processing Pipelines for Portuguese texts, which correspond to sequences of tasks that needed to be performed before the TM strategies. More specifically, we evaluate different pre-processing pipeline configurations using different semantic data representations to overcome the challenges faced by TM strategies in Portuguese Text. In our experimentation evaluation, considering two datasets collected from Twitter and Reddit related to Brazilian political discussion, we show that our proposed extended pre-processing pipeline, especially considering semantic representations, can achieve significant gains in effectiveness when compared to the TM approaches originally proposed for English texts (up to 9x better).