Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

ACL ARR 2025 May Submission6524 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Transliteration has emerged as a powerful means to bridge the gap between various languages in multilingual NLP, showing promising results on unseen languages without respect to script. While it is widely understood that this success is due to the degree to which transliteration results in a shared representational space among languages, we investigate the degree to which shared script, an overlap in token vocabularies, and shared phonology contribute to performance of models relying on transliteration. To investigate this question, we train and evaluate models using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate on two downstream tasks, named entity recognition (NER) and natural language inference (NLI), yielding results largely consistent with our hypothesis---that romanization is most effective because it results in sharing of all three kinds.

Paper Type: Short

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: multilingualism, cross-lingual transfer, multilingual representations, multilingual pre-training

Contribution Types: Model analysis & interpretability

Languages Studied: Swedish, Portuguese, Ligurian, Catalan, Romanian, Spanish, Polish, Albanian, Haitian, French, Serbian, Bengali, Hindi, Croatian, Oriya, Russian, Urdu, Iloko, Shona, Latvian, Uzbek, German, Finnish, Somali, Swahili, Amharic, Telugu, Thai, Georgian, Korean, Burmese, Turkish, Vietnamese

Submission Number: 6524

Loading