Keywords: Multilingual Language Model, Natural Language Processing, Transliteration, Underrepresented Language Modeling
Abstract: While impressive performance in natural language processing tasks has been achieved for many languages by transfer learning from large pretrained multilingual language models, it is limited by the unavailability of large corpora for most languages and the barrier of different scripts. Script difference forces the tokens of two languages to be separated at the input. Thus we hypothesize that transliterating all the languages to the same script can improve the performance of language models. Languages of South Asia and Southeast Asia present a unique opportunity of testing this hypothesis as almost all of the major languages in this region have their own script. Nevertheless, it is possible to transliterate them to a single representation easily. We validate our hypothesis empirically by pretraining ALBERT models on the Indo-Aryan languages available on the OSCAR corpus and measuring the model's performance on the Indo-Aryan subset of the IndicGLUE benchmark. Compared to the non-transliteration-based model, the transliteration-based model (termed XLM-Indic) shows significant improvement on almost all tasks of IndicGLUE. For example, XLM-Indic performed better on News Classification (0.41%), Multiple Choice QA (4.62%), NER (6.66%), and Cloze-Style QA (3.32%). In addition, XLM-Indic establishes new SOTA results for most tasks the on IndicGLUE benchmark while being competitive at the rest. Across the tasks of IndicGLUE, the most underrepresented languages seem to gain the most improvement. For instance, for the NER, XLM-Indic achieves 10%, 35%, and 58.5% better F1-scores on Gujarati, Panjabi, and Oriya languages compared to the current SOTA.
One-sentence Summary: Transliteration a simple technique to improve multilingual language model performance.
Supplementary Material: zip
23 Replies
Loading