Phonetics Encode Language Identity More Than Grammar: Evidence from LOLO Typology Prediction

Phonetics Encode Language Identity More Than Grammar: Evidence from LOLO Typology Prediction

ACL ARR 2026 January Submission2102 Authors

01 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: computational typology, phonology--syntax interface, International Phonetic Alphabet (IPA), WALS, cross-linguistic generalization, language identification, adversarial representation learning

Abstract: We test whether phonetic surface form alone can predict grammatical typology across languages. Using verse-aligned parallel Bible translations in 14 typologically diverse languages, we convert each verse to an International Phonetic Alphabet (IPA) character sequence via Epitran and predict two language-level WALS features: basic word order (SOV/SVO/VSO) and gender system size (None/Two/Three/Many). Because labels are constant within a language, random verse splits leak language identity and substantially overestimate generalization; we therefore adopt leave-one-language-out (LOLO) evaluation as the primary protocol. Across character n-gram TF–IDF baselines, phonological and phonotactic features, BiLSTM and Transformer encoders, and a gradient-reversal adversarial objective to suppress language-ID cues, random splits yield near-perfect accuracy, consistent with memorization. Under LOLO, performance is modest and highly variable across held-out languages, and representation analyses show embeddings cluster by language identity (and correlated genealogical and areal effects) more strongly than by typology. We release a reproducible IPA pipeline and offer an evaluation caution: in a phonetics-only setting, IPA robustly encodes language identity, while transferable signal for broad grammatical typology is weak and sensitive to label coverage.

Paper Type: Long

Research Area: Multilinguality and Language Diversity

Research Area Keywords: multilingual evaluation, cross-lingual transfer, multilingual representations

Contribution Types: Model analysis & interpretability, Data analysis, Position papers

Languages Studied: Arabic, French, Hindi, Indonesian, Japanese, Mandarin Chinese, Māori, Russian, Spanish, Swahili, Tagalog, Telugu, Turkish, Zulu

Submission Number: 2102

Loading