Language Confusion and Multilingual Performance: A Case Study of Thai-Adapted Large Language Models

ACL ARR 2025 May Submission2394 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper investigates the code-switching problem between English and Thai languages in large language models (LLMs), especially those encountered the continual pre-training process (CPT) and those initially trained with multilingual data, called multilingual LLMs (MLLMs). We change the language in several parts of the prompt, namely task instruction, context, and output language to examine the effects of the language variation settings on the code-switched language in the responses for different model types. Our findings show that mismatches between context and output language result in significant performance degradation in all the model types and the models achieve similar performance for monolingual settings, while MLLMs show stronger robustness on the cross-lingual settings. It suggests that given high cost of multilingual training from scratch, we might still need MLLMs for downstream tasks in languages other that English due to their multilingual capability which is better than CPT models and those trained without any multilingual interventions.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Multilingualism and Cross-Lingual NLP
Contribution Types: Model analysis & interpretability
Languages Studied: English, Thai
Submission Number: 2394
Loading