Language Confusion and Multilingual Performance: A Case Study of Thai-Adapted Large Language Models

ACL ARR 2025 February Submission1595 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper investigates the code-switching problem between English and Thai languages in large language models (LLMs), especially those encountered the continual pre-training process (CPT) and those initially trained with multilingual data, called multilingual LLMs (MLLMs). We change the language in the task instruction, context, and output language of the prompt to examine the effects of the language variation settings on the code-switched language in the responses for different model types. Our findings show that mismatches between context and output languages result in significant performance degradation in all the model types and they achieve similar performance for monolingual settings, while MLLMs show improvements on the cross-lingual settings. It suggests that given high cost of multilingual training from scratch, we might still need MLLMs for downstream tasks in languages other than English due to their multilingual capability which is better than CPT models and models trained without any multilingual interventions.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: code-switching, language confusion, multilingual large language models
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings
Languages Studied: English, Thai
Submission Number: 1595
Loading