Abstract: Transliteration has emerged as a powerful means to bridge the gap between various languages in multilingual NLP, showing promising results on unseen languages without respect to script. While it is widely understood that this success is due to the degree to which transliteration results in a shared representational space among languages, we investigate the degree to which shared script, an overlap in token vocabularies, and shared phonology contribute to performance of models relying on transliteration. To investigate this question, we train and evaluate models using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate on two downstream tasks, named entity recognition (NER) and natural language inference (NLI), yielding results largely consistent with our hypothesis---that romanization is most effective because it results in sharing of all three kinds.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingualism, cross-lingual transfer, multilingual representations, multilingual pre-training
Contribution Types: Model analysis & interpretability
Languages Studied: Swedish, Portuguese, Ligurian, Catalan, Romanian, Spanish, Polish, Albanian, Haitian, French, Serbian, Bengali, Hindi, Croatian, Oriya, Russian, Urdu, Iloko, Shona, Latvian, Uzbek, German, Finnish, Somali, Swahili, Amharic, Telugu, Thai, Georgian, Korean, Burmese, Turkish, Vietnamese
Submission Number: 6524
Loading