Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment
Abstract: Multilingual pre-trained models (mPLMs) have shown impressive performance on cross-lingual transfer tasks.
However, the transfer performance is often hindered when a low-resource target language is written in a different script than the
high-resource source language, even though the two languages may be related or share parts of their vocabularies.
Inspired by recent work that uses transliteration to address
this problem, our paper proposes a transliteration-based
post-pretraining alignment (PPA) method
aiming to improve the cross-lingual alignment between languages using diverse scripts.
We select two areal language groups, $\textbf{Mediterranean-Amharic-Farsi}$ and $\textbf{South+East Asian Languages}$, wherein the languages
are mutually influenced but use different scripts. We apply our method to these language groups and conduct extensive experiments
on a spectrum of downstream tasks. The results show that
after PPA, models consistently outperform the original
model (up to 50\% for some tasks) in English-centric
transfer. In addition, when we use languages other
than English as sources in transfer,
our method obtains even larger improvements. We will make our code and models publicly available.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: cross-lingual transfer, less-resourced languages, multilingualism, multilingual evaluation, transliteration
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: ara, arb, ary, arz, fas, amh, ell, heb, tur, mlt, zho, lzh, yue, wuu, kor, lao, lhu, mya, bod, tha
Submission Number: 1181
Loading