Cross-Lingual Clustering Using Large Language Models

Published: 01 Jan 2024, Last Modified: 09 Mar 2025GeoAI@SIGSPATIAL 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text clustering methods traditionally rely on a shared vocabulary and script, which poses a challenge for cross-lingual text clustering problems that arise in a variety of domains including social media, news, finance, and more. Recent approaches to cross-lingual clustering have found success by leveraging latent embedding space representations of neural models and more recently by directly using Large Language Models (LLMs) to do text clustering in zero-shot or few-shot settings. However, much of the recent work focuses on short text, like social media posts. In this paper, we use cross-lingual clustering in the news domain as a case study to test whether LLMs can effectively cluster long documents by extracting and maintaining keyphrases associated with each cluster of documents. We compare the clustering several LLMs produce in a zero-shot setting to a more traditional online clustering method that uses TF-IDF to cluster documents based on their content and time of publication. We find that LLMs tend to cluster the articles based on the text, in particular based on the language of the text more than the content, and ignore the time and location of publication, indicating further work is needed before LLMs can reliably be used in clustering news articles across multiple languages.
Loading