Islands, Hubs, and One-Way Streets: A Geometric Autopsy of Multilingual Embedding Space Collapse and Retrieval Asymmetry

ACL ARR 2026 January Submission9188 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multilingual embeddings, cross-lingual retrieval, embedding geometry, anisotropy, hubness, language islands, retrieval asymmetry, low-resource languages, semantic alignment, multilingual evaluation
Abstract: The promise of multilingual Large Language Models (LLMs) hinges on their ability to construct a universal semantic space where concepts are represented independently of source language. This study rigorously evaluates this premise across five state-of-the-art embedding models (Google Gemini, Qwen-Qwen3, Mistral-Embed, and OpenAI Text-3) using a novel suite of 12 geometric probes. Our analysis reveals that current architectures have not achieved a true interlingua; instead, they generate distinct "language islands'' with linear separability exceeding 99.7%. We expose a critical "Mistral Paradox,'' where smaller, dense models capture phylogenetic relationships (e.g., Hindi-Bangla affinity) better than larger models yet suffer from severe anisotropy (score: 0.81) and "hubness,'' resulting in a collapse of retrieval reciprocity to $\sim$13%. Conversely, Gemini (3072d) achieves superior isometric stability (31.4% reciprocity) but fails to capture subtle linguistic family structures. We further identify a "Tokenization Tax,'' where verbose non-Latin scripts induce significant geometric expansion or collapse depending on the architecture. Our findings suggest that scaling dimensions alone cannot resolve the asymmetric "one-way street'' of cross-lingual retrieval, highlighting the need for geometric alignment techniques that enforce isometric consistency.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: probing, robustness, data shortcuts/artifacts
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data analysis
Languages Studied: English, Bangla, Hindi, Arabic
Submission Number: 9188
Loading