Track: Main Papers Track (6 to 9 pages)
Keywords: Multilingual representation learning, geometric analysis, representation topology, cross-lingual generalization, diagnostic frameworks, zero-shot transfer
Abstract: The manner in which multilingual models encode linguistic diversity remains an unresolved issue that extends beyond standard accuracy metrics. This work presents TopoLingEval, a lightweight diagnostic framework that integrates geometric projection using principal component analysis (PCA), centroid distance analysis, and typological correlation to analyze the structure of multilingual representation spaces. When applied to a large masked encoder (XLM-R) and a compact instruction-tuned model (mT0-small), the framework reveals that model scale and alignment objectives are associated with distinct topological and behavioral patterns. Specifically, XLM-R forms a compact, homogeneous representational space that is associated with stronger zero-shot question answering (QA) transfer, while mT0-small maintains clearer language boundaries but exhibits weaker generalization in this setting. In this focused pilot study, we observe that more compact multilingual spaces, as measured by lower mean inter-language distance, consistently co-occur with more stable zero-shot transfer in this setting, while typological correlation remains low for both models. Overall, TopoLingEval is intended as a reproducible diagnostic tool for examining multilingual geometry and raising hypotheses about its relationship to cross-lingual generalization, including potential implications for the evaluation of low-resource languages.
Submission Number: 42
Loading