Keywords: artificial languages, learnability, language model, cross-lingual, linguistic theories, word order, transformer
Abstract: Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts learnability. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.
Paper Type: Long
Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics
Research Area Keywords: linguistic theories, cognitive modeling, computational psycholinguistics
Contribution Types: Model analysis & interpretability, Data analysis, Theory
Languages Studied: French, Portuguese, English, Swedish, Danish, Latvian, Czech, Hungarian, Estonian, Finnish
Submission Number: 1886
Loading