Keywords: computational typology, phonology--syntax interface, International Phonetic Alphabet (IPA), WALS, cross-linguistic generalization, language identification, adversarial representation learning
Abstract: We test whether phonetic surface form alone can predict grammatical typology across languages. Using verse-aligned parallel Bible translations in 14 typologically diverse languages, we convert each verse to an International Phonetic Alphabet (IPA) character sequence via Epitran and predict two language-level WALS features: basic word order (SOV/SVO/VSO) and gender system size (None/Two/Three/Many). Because labels are constant within a language, random verse splits leak language identity and substantially overestimate generalization; we therefore adopt leave-one-language-out (LOLO) evaluation as the primary protocol. Across character n-gram TF–IDF baselines, phonological and phonotactic features, BiLSTM and Transformer encoders, and a gradient-reversal adversarial objective to suppress language-ID cues, random splits yield near-perfect accuracy, consistent with memorization. Under LOLO, performance is modest and highly variable across held-out languages, and representation analyses show embeddings cluster by language identity (and correlated genealogical and areal effects) more strongly than by typology. We release a reproducible IPA pipeline and offer an evaluation caution: in a phonetics-only setting, IPA robustly encodes language identity, while transferable signal for broad grammatical typology is weak and sensitive to label coverage.
Paper Type: Long
Research Area: Multilinguality and Language Diversity
Research Area Keywords: multilingual evaluation, cross-lingual transfer, multilingual representations
Contribution Types: Model analysis & interpretability, Data analysis, Position papers
Languages Studied: Arabic, French, Hindi, Indonesian, Japanese, Mandarin Chinese, Māori, Russian, Spanish, Swahili, Tagalog, Telugu, Turkish, Zulu
Submission Number: 2102
Loading