What Language Do Non-English-Centric Large Language Models Think in?

What Language Do Non-English-Centric Large Language Models Think in?

ACL ARR 2025 February Submission3798 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this study, we investigate whether non-English-centric large language models, `think' in their specialized language. Specifically, we analyze how intermediate layer representations, when projected into the vocabulary space, favor certain languages during generation—termed as latent languages. We categorize non-English-centric models into two groups: CPMs, which are English-centric models with continued pre-training on its specialized language, and BLMs, which are pre-trained on a balanced mix of multiple languages from scratch. Our findings reveal that while English-centric models rely exclusively on English as their latent language, non-English-centric models activate multiple latent languages, dynamically selecting the most similar one based on both the source and target languages. This also influences responses to culture difference questions, reducing English-centric biases in non-English models. This study deepens our understanding of language representation in non-English-centric LLMs, shedding light on the intricate dynamics of multilingual processing at the representational level.

Paper Type: Long

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: Interpretability and Analysis of Models for NLP, Multilingualism and Cross-Lingual NLP, Ethics, Bias, and Fairness

Contribution Types: Model analysis & interpretability

Languages Studied: English, Japanese, Chinese, French, Arabic

Submission Number: 3798

Loading