Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

Happiness is Sharing a Vocabulary: A Study of Transliteration Methods

ACL ARR 2025 February Submission5948 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Transliteration has emerged as a powerful means to bridge the gap between various languages in multilingual NLP, showing promising results on unseen languages without respect to script. While it is widely understood that this success is due to the degree to which transliteration results in a shared representational space among languages, we investigate the degree to which shared script, an overlap in token vocabularies, and shared phonology contribute to performance of models relying on transliteration. To investigate this question, we train and evaluate models using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We use named entity recognition as a downstream task for evaluation. Our results are largely consistent with our hypothesis---that romanization is most effective because it results in sharing of all three kinds.

Paper Type: Short

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: multilingualism,cross-lingual transfer,multilingual representations,multilingual pre-training

Contribution Types: Model analysis & interpretability

Languages Studied: Swedish, Portuguese,Ligurian,Catalan,Romanian,Spanish,Polish,Albanian,Haitian,French,Serbian,Bengali,Hindi,Croatian,Oriya,Russian,Urdu,Iloko,Shona,Latvian,Uzbek,German,Finnish,Somali,Swahili,Amharic,Telugu,Thai,Georgian,Korean,Burmese

Submission Number: 5948

Loading