Vocabulary Shapes Cross-Lingual Variation of Word-Order Learnability in Language Models

Vocabulary Shapes Cross-Lingual Variation of Word-Order Learnability in Language Models

ACL ARR 2026 January Submission1886 Authors

31 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: artificial languages, learnability, language model, cross-lingual, linguistic theories, word order, transformer

Abstract: Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts learnability. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.

Paper Type: Long

Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics

Research Area Keywords: linguistic theories, cognitive modeling, computational psycholinguistics

Contribution Types: Model analysis & interpretability, Data analysis, Theory

Languages Studied: French, Portuguese, English, Swedish, Danish, Latvian, Czech, Hungarian, Estonian, Finnish

Submission Number: 1886

Loading