Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases

ACL ARR 2024 December Submission1614 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This study analyzes the ability of Large Language Model (LLM) to simulate non-native second language (L2) English speakers interfered by their prior knowledge of native first language (L1). Specifically, we analyze L1-dependent language interference with L2 (English) dialogues simulated by LLMs with prompting. Our proposed L1-interference evaluation framework focuses on diverse linguistic features such as reference word usage and (in)frequently adopted syntactic constructions biased by L1 (due to, e.g., avoidance behaviors), which are identified through distributional density comparisons using information-theoretic metrics. Our results demonstrate that LLMs can generally emulate L1-dependent linguistic biases reflected in L2 dialogues. Specifically, the impact of native languages varies, for example, with L1s such as Japanese, Korean, and Mandarin significantly affecting tense agreement, Urdu influencing noun-verb collocations, and Thai shaping the use of numerals and modifier, and they agree with real human L2 data. These insights unveil LLMs' potential use for generating diverse L2 dialogues as well as offer a theoretical framework for LLM L2-dialogue evaluation.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: language resources; second language dialogues; cross-lingual transfer; evaluation methodologies; statistical testing for evaluation; LLM generation dialogue evaluation
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: English Second Language; Japanese; Korean; Urdu; Thai; Malay; Cantonese; Mandarin
Submission Number: 1614
Loading