Abstract: Transliteration has emerged as a powerful means to bridge the gap between various languages in multilingual NLP, showing promising results on unseen languages without respect to script. While it is widely understood that this success is due to the degree to which transliteration results in a shared representational space among languages, we investigate the degree to which shared script, an overlap in token vocabularies, and shared phonology contribute to performance of models relying on transliteration. To investigate this question, we train and evaluate models using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We use named entity recognition as a downstream task for evaluation. Our results are largely consistent with our hypothesis---that romanization is most effective because it results in sharing of all three kinds.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingualism,cross-lingual transfer,multilingual representations,multilingual pre-training
Contribution Types: Model analysis & interpretability
Languages Studied: Swedish, Portuguese,Ligurian,Catalan,Romanian,Spanish,Polish,Albanian,Haitian,French,Serbian,Bengali,Hindi,Croatian,Oriya,Russian,Urdu,Iloko,Shona,Latvian,Uzbek,German,Finnish,Somali,Swahili,Amharic,Telugu,Thai,Georgian,Korean,Burmese
Submission Number: 5948
Loading