Improving Cross-Lingual Neural Topic Modeling with Document-Level Prototype-based Contrastive Learning

Improving Cross-Lingual Neural Topic Modeling with Document-Level Prototype-based Contrastive Learning

ACL ARR 2025 May Submission4976 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Cross-lingual topic modeling (CLTM) is an essential task in the field of data mining and natural language processing, aiming to extract aligned and semantically coherent topics from bilingual corpora. Recent advances in cross-lingual neural topic models have widely leveraged bilingual dictionaries to achieve word-level topic alignment. However, two critical challenges remain in cross-lingual topic modeling, the topic mismatch issue and the degeneration of intra-lingual topic interpretability. Due to linguistic diversity, some translated word pairs may not represent semantically coherent topics despite being lexical equivalents, and the objective of cross-lingual topic alignment in CLTM can consequently degrade topic interpretability within intra languages. To address these issues, we propose a novel document-level prototype-based contrastive learning paradigm for cross-lingual topic modeling. Additionally, we design a retrieval-based positive sampling strategy for contrastive learning without data augmentation. Furthermore, we introduce ProtoXTM, a cross-lingual neural topic model based on doucment-level prototype-based contrastive learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on cross-lingual and mono-lingual benchmarks, demonstrating enhanced topic interpretability.

Paper Type: Long

Research Area: Information Retrieval and Text Mining

Research Area Keywords: Topic modeling, Cross-Lingual NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English, Chinese

Submission Number: 4976

Loading