Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models

ACL ARR 2025 May Submission4396 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Language confusion---where large language models (LLMs) generate unintended languages against the user's need---remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs)---specific positions where language switches occur---are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with multilingual-tuned models, substantially mitigates confusion without harming general competence or fluency. Our approach matches multilingual alignment in confusion reduction for most languages and yields cleaner, higher-quality outputs. These findings provide new insights into the internal dynamics of LLMs and highlight neuron-level interventions as a promising direction for robust, interpretable multilingual language modeling. Code and data will be released upon publication.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: mixed language, multilingual evaluation, multilingual representation, language confusion, mechanistic interpretability
Contribution Types: Model analysis & interpretability
Languages Studied: Arabic, English, Portuguese, Turkish, Chinese, Spanish, French, Hindi, Russian, Japanese, Korean, German, Indonesian, Italian, Vietnamese
Submission Number: 4396
Loading