Do Monolingual Language Models Learn Cross-Lingual Universal Conceptual Representations?

Published: 01 Mar 2026, Last Modified: 01 Mar 2026UCRL@ICLR2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multilingual Concept Representation, Multilingual Language Models, Activation Patching
TL;DR: Monolingual language models develop partially compatible, cross-lingual conceptual representations, and activation patching reveals these representations can causally transfer semantic knowledge across languages.
Abstract: Do language models learn universal, language-agnostic conceptual representations, or merely language-specific features that appear aligned under correlational analysis? We investigate this question using Goldfish, a family of independently trained small monolingual causal language models spanning 350 languages that share architecture and training budgets but differ entirely in data, vocabulary, and parameters, enabling the study of emergent conceptual structure without multilingual supervision. We first evaluate cross-lingual representational alignment using centered kernel alignment (CKA) at sentence and token levels on semantically matched parallel data, showing that independently trained models exhibit robust alignment beyond architectural baselines, with alignment strength scaling with training data and linguistic proximity. We then introduce cross-lingual activation patching as an interventional framework for testing concept validity by injecting hidden representations from a source-language model into a target-language model without learned projection or alignment. Across controlled case studies and large-scale contrastive evaluations, patched activations systematically steer target predictions in semantically consistent directions, with the strongest causal effects emerging in early and intermediate layers. These results provide evidence that monolingual language models learn partially compatible concept-level representations that support cross-model semantic transfer, positioning activation patching as a scalable technique for evaluating learned concepts under causal abstraction and offering new insights into the foundations of emergent universal concept learning and representation in large language models across the world’s languages.
Submission Number: 42
Loading