Cross-lingual Text Clustering in a Large System

Nicole R. Schneider, Jagan Sankaranarayanan, Hanan Samet

Published: 2023, Last Modified: 09 Mar 2025NLPIR 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The multilingual world needs systems that can cluster text written in multiple languages into the same thread or topic. Clustering multilingual text can be accomplished by translating and then clustering text in a canonical language, using multilingual embeddings to cluster articles in a shared embedding space, and via other language-independent methods. The performance and pitfalls of these various methods have not been well studied in the context of real-time clustering across documents written in many languages. We address this problem by generating a large dataset of news articles using a reference architecture that continuously indexed and clustered articles spanning 17 languages over the last 15 years. Through the analysis of these documents and their clusters, the clustering quality is shown to be dependent on the normalization of proper nouns, the types of georeferences, and the overall geographic focus of the document.