Code-Switching Is Not Noise: Evaluating LLMs on the Language People Actually Speak

Code-Switching Is Not Noise: Evaluating LLMs on the Language People Actually Speak

ACL ARR 2026 May Submission17292 Authors

26 May 2026 (modified: 18 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: code-switching, multilingual evaluation, LLM-as-assistant evaluation, human-centered NLP, Hinglish, Spanglish, register, pragmatics, interactional fit

Abstract: For many multilingual users, code-switching is not degraded language but the ordinary communicative register of everyday interaction. Yet multilingual LLM evaluation often treats languages as separable monolingual conditions, while prior code-switching benchmarks primarily evaluate classification or comprehension tasks. We argue that code-switched assistant interaction is a first-class human-centered NLP evaluation setting. We introduce a diagnostic probe of 100 Hinglish and Spanglish prompt groups, each paired with English and local-language controls, and evaluate GPT-4o, Claude, Qwen2.5-72B, and Llama-3.3-70B on generative assistant responses. Responses are scored for task success, register preservation, pragmatic intent preservation, non-translation compliance, and naturalness, with binary failure labels for silent monolingualisation, register collapse, translation-over-assistance, pragmatic cue loss, and over-formalisation. Results show that code-switched prompts do not primarily break task completion: mean task success remains high (1.77/2), close to English controls (1.86/2). Instead, they break interactional fit: register preservation drops to 0.88/2 and naturalness to 0.94/2. Even GPT-4o and Claude silently monolingualise 38% and 34% of code-switched prompts. We argue that multilingual assistant evaluation must measure not only whether models understand users, but whether they respect how users actually speak.

Paper Type: Short

Research Area: Human-Centered NLP and Human-AI Interaction

Research Area Keywords: human-AI interaction, human-centered evaluation, multilingual evaluation, code-switching, sociolinguistics

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis

Languages Studied: English, Hindi, Spanish (Hinglish and Spanglish code-switched varieties)

EMNLP 2026 AI Reviewing Experiment: yes

Submission Number: 17292

Loading