Phonemes to the Rescue: Multilingual Tokenization Based on International Phonetic Alphabet

ACL ARR 2026 January Submission10466 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multilingual tokenization, phonetic tokenization, tokenization quality, cross-lingual tokenization fairness
Abstract: Multilingual language models often exhibit performance disparities across languages that can arise as early as the tokenization stage. Widely-used subword tokenization approaches favor high-resource languages, and tokenizer-free methods still yield longer sequences for scripts with a higher bytes-per-character ratio. To address these shortcomings, we propose to use the International Phonetic Alphabet (IPA) as a language-agnostic input representation for multilingual tokenizers. IPA provides a compact symbol inventory, greater cross-lingual character overlap, and a more balanced byte-per-character distribution across languages. We train matched pairs of text vs. IPA subword tokenizers across 24 languages and 14 scripts and demonstrate that IPA tokenizers consistently improve tokenization quality, especially for non-Latin scripts, and generalize more effectively to unseen languages and scripts.
Paper Type: Long
Research Area: Multilinguality and Language Diversity
Research Area Keywords: multilingual representations, multilingual pre-training, multilingual evaluation, less-resourced languages, cross-lingual transfer, subword representations, phonology, grapheme-to-phoneme conversion
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: Arabic, Amharic, Burmese, Chinese, English, Finnish, French, German, Greek, Hebrew, Hindi, Italian, Japanese, Korean, Lao, Persian, Polish, Russian, Serbian, Spanish, Swahili, Thai, Turkish, Urdu
Submission Number: 10466
Loading