Track: tiny / short paper (up to 5 pages)
Domain: machine learning
Abstract: Do language models learn universal, language-agnostic conceptual representations, or language-specific features that only appear aligned under correlational analysis? We investigate this using \textit{Goldfish}, a family of independently trained small monolingual causal language models spanning 350 languages that share architecture and training budgets but differ entirely in data, vocabulary, and parameters, enabling the study of emergent conceptual structure without multilingual supervision. We evaluate cross-lingual representational alignment using centered kernel alignment (CKA) at sentence and token levels on semantically matched parallel data, showing robust alignment beyond architectural baselines that scales with training data and linguistic proximity. We then introduce \textbf{cross-lingual activation patching}, an interventional framework that injects hidden representations from a source-language model into a target-language model without learned projection. Across controlled case studies and contrastive evaluations, patched activations steer target predictions in semantically consistent directions, with strongest causal effects in early and intermediate layers. These results suggest that monolingual language models learn partially compatible concept-level representations that enable cross-model semantic transfer, positioning activation patching as a scalable method for evaluating causal conceptual alignment across languages.
Presenter: ~Suchir_Salhan1
Submission Number: 81
Loading