Which English Do LLMs Prefer? Quantifying American and British English Through a Postcolonial Lens

ICLR 2026 Conference Submission22199 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, pretraining corpora, tokenization, generative preferences, postcolonial lens, modeling bias, dialectal inclusivity
TL;DR: We provide the first systematic study of how LLMs privilege American over British English across data, tokenization, and generation, exposing modeling bias rooted in historical and sociopolitical asymmetries through a postcolonial lens.
Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, yet they expose only limited language settings, most notably “English (US)”, despite the colonial history and global diversity of English. We frame dialectal asymmetries through a postcolonial lens, showing that they emerge not only as downstream failures but as structural artifacts of the LLM development pipeline itself. Using a curated lexicon of 1,813 American–British variants, we triangulate evidence across three stages: (i) audits of six major pretraining corpora reveal systematic skew toward American English, (ii) tokenizer analyses demonstrate that British forms incur higher segmentation costs, and (iii) generative evaluations with our proposed DiAlign metric show consistent preference for American variants. This constitutes the first systematic examination of dialectal asymmetries in standard English varieties within LLMs. We find that these models exhibit modeling bias that privileges American English as the de facto norm, shaped by geopolitical histories of data curation and linguistic standardization. Our study raises concerns of linguistic homogenization, epistemic injustice, and inequity in global AI deployment, while offering practical guidance for developing more dialectally inclusive language technologies.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22199
Loading