Code-Switching Is Not Noise: Evaluating LLMs on the Language People Actually Speak

ACL ARR 2026 May Submission17292 Authors

26 May 2026 (modified: 18 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: code-switching, multilingual evaluation, LLM-as-assistant evaluation, human-centered NLP, Hinglish, Spanglish, register, pragmatics, interactional fit
Abstract: For many multilingual users, code-switching is not degraded language but the ordinary communicative register of everyday interaction. Yet multilingual LLM evaluation often treats languages as separable monolingual conditions, while prior code-switching benchmarks primarily evaluate classification or comprehension tasks. We argue that code-switched assistant interaction is a first-class human-centered NLP evaluation setting. We introduce a diagnostic probe of 100 Hinglish and Spanglish prompt groups, each paired with English and local-language controls, and evaluate GPT-4o, Claude, Qwen2.5-72B, and Llama-3.3-70B on generative assistant responses. Responses are scored for task success, register preservation, pragmatic intent preservation, non-translation compliance, and naturalness, with binary failure labels for silent monolingualisation, register collapse, translation-over-assistance, pragmatic cue loss, and over-formalisation. Results show that code-switched prompts do not primarily break task completion: mean task success remains high (1.77/2), close to English controls (1.86/2). Instead, they break interactional fit: register preservation drops to 0.88/2 and naturalness to 0.94/2. Even GPT-4o and Claude silently monolingualise 38% and 34% of code-switched prompts. We argue that multilingual assistant evaluation must measure not only whether models understand users, but whether they respect how users actually speak.
Paper Type: Short
Research Area: Human-Centered NLP and Human-AI Interaction
Research Area Keywords: human-AI interaction, human-centered evaluation, multilingual evaluation, code-switching, sociolinguistics
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English, Hindi, Spanish (Hinglish and Spanglish code-switched varieties)
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 17292
Loading