What If Chinese Were Latinized? A Counterfactual Study of Script, Tokenization, and Language Modeling
Keywords: Tokenization, Chinese NLP, Orthography, Counterfactual Pretraining
Abstract: The Latinxua Sin Wenz movement of the 1920s--50s proposed replacing Chinese characters with a Latinized script.
Though the movement ultimately failed, it raises a compelling counterfactual for NLP:
what if Chinese had been written in Latin script all along?
We construct this alternative reality by converting Chinese corpora to pinyin with word boundaries and training superBPE tokenizers from scratch under three pinyin orthographic conventions (toneless, number-toned, and diacritic).
In the first of a planned series of experiments, we analyze the resulting tokenizer vocabularies across several axes:
vocabulary composition, cross-script token overlap, and homophone collision patterns.
We find that Latinized Chinese tokenizers exhibit fundamentally different vocabulary structures,
with substantial homophone collisions where multiple distinct Chinese words collapse into identical pinyin tokens.
Cross-script overlap analysis reveals that pinyin tokenizers generate thousands of tokens with no character-based equivalent, suggesting that Latin-script Chinese would inhabit a markedly different subword space.
These findings contribute to the ongoing debate on the tokenisation tax imposed on non-Latin-script languages and illuminate how deeply script choice shapes the foundations of NLP systems.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 44
Loading