TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data

ACL ARR 2024 June Submission808 Authors

13 Jun 2024 (modified: 10 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transliterating related languages that use different scripts into a common script is effective for improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is undesirable because it requires a large computation budget. A more promising way is to make full use of available mPLMs. To this end, this paper proposes a simple but effective framework: $\textbf{Trans}$literate-$\textbf{M}$erge-$\textbf{I}$nitialize ($\textbf{TransMI}$). TransMI is a strong baseline well-suited for data that is transliterated into a common script by exploiting an mPLM and its tokenizer. TransMI has three stages: ($\textbf{a}$) transliterate the vocabulary of an mPLM into a common script; ($\textbf{b}$) merge the new vocabulary with the original vocabulary; and ($\textbf{c}$) initialize the embeddings of the new subwords. We apply TransMI to three strong recent mPLMs. Our experiments demonstrate that TransMI not only preserves the mPLM's ability to handle non-transliterated data, but also enables it to effectively process transliterated data, thereby facilitating crosslingual transfer. The results show consistent improvements of 3\% to 34\% for different mPLMs and tasks. We will make our code and models publicly available.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: transliteration, multilingualism, multilingual evaluation, mPLMs, cross-lingual transfer
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: Our final modified models are evaluated on datasets that covers more than 300 languages.
Submission Number: 808
Loading