A Co-concerned Multilingual Topic Detection Model Based on mT5 and Frequency EntropyDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: Topic models play a crucial role in various fields such as text classification and semantic extraction. However, the enhancement of the quality of topic words faces persistent challenges and has been explored for a long time. Among these, attention bias stemming from different language cultures often emerges, particularly in hot events. While topic models excel at detecting incident topics, they are susceptible to the influence of bias misguidance. Furthermore, existing topic models encounter limitations when applied to multilingual corpora, as synonymous multilingual representations may disproportionately occupy the forefront of the output sequence. In light of these issues, we propose a model that combines the text clustering algorithm of BERTopic with the extraction of topic words using a tuned mT5. The output words are filtered using a word table that stores words with high information entropy in multiple languages. We conducted experiments on our dataset, demonstrating high performance not only in commonly focused multilingual topic detection but also in the elimination of output redundancy.
Paper Type: long
Research Area: Multilinguality and Language Diversity
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: Chinese, Japanese, Korean, Arabic, English
0 Replies

Loading