Beyond Bouba & Kiki: Does Sound Symbolism scale across 27 Languages?

ACL ARR 2025 May Submission7109 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper investigates whether phonemes consistently convey size-related meaning across languages, a phenomenon known as sound symbolism. We compile a typologically diverse dataset of 810 adjectives (30 per language across 27 languages and 13 families), each phonemically transcribed and validated using native speaker recordings. Using bag-of-phoneme vectors and baseline classifiers, we show that size semantics can be predicted from phonological features with statistically significant accuracy, even across unrelated languages. Surprisingly, consonants such as /q/ and /ɧ/ emerge as highly predictive, challenging prior work that emphasizes vowel symbolism. To separate symbolic patterns from language-specific cues, we introduce an adversarial model that penalizes language prediction while preserving size-related information. Under the adversarial setup, however, classification accuracy drops to near chance—suggesting that much of the symbolic signal may be entangled with language-specific structure, or that larger datasets may be needed to detect more subtle cross-linguistic patterns.
Paper Type: Long
Research Area: Phonology, Morphology and Word Segmentation
Research Area Keywords: Phonology, pronunciation modeling, grapheme-to-phoneme conversion
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: Azerbaijani, Bulgarian, Czech, Danish, Dutch, French, Georgian, Greek, Hebrew, Hindi, Hungarian, Indonesian, Italian, Maltese, Mandarin, Marathi, Polish, Portuguese, Romanian, Russian, Spanish, Arabic, German, Tamil, Turkish, Twi, Ukrainian
Submission Number: 7109
Loading