Topology Matters: How Scale and Alignment Reshape Multilingual Spaces

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multilingual NLP, Representation topology, Alignment objectives, Low-resource languages, Cross-lingual transfer
Abstract: Multilingual pretrained models underpin cross-lingual transfer in NLP, yet how their internal spaces encode language identity remains poorly understood. Large encoders such as XLM-R have been shown to ``flatten'' linguistic variation, while compact instruction-tuned models like mT0-small emphasize task alignment. A systematic comparison of these families at the level of representation topology is missing. We introduce TopoLingEval, a lightweight framework that probes multilingual geometry through three lenses: (i) PCA and t-SNE projections to reveal global versus local structure, (ii) centroid distance analysis to quantify overlap or separation between languages, and (iii) correlation with typological resources to assess linguistic grounding. Using six diverse languages from TyDiQA, we compare XLM-R, a large multilingual encoder, with mT0-small, a compact instruction-tuned model. Our findings show complementary behaviours. XLM-R captures greater global variance but collapses languages into overlapping regions, obscuring language identity. By contrast, mT0-small produces sharper clusters and larger inter-language distances, but without meaningful correlation to genealogical similarity. This reveals a trade-off between cross-lingual sharing and typological grounding. We argue that future multilingual models must balance inclusivity with linguistic structure and be evaluated not only on accuracy but also on representational fidelity.
Submission Number: 238
Loading