Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment

Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment

ACL ARR 2024 June Submission1181 Authors

14 Jun 2024 (modified: 05 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multilingual pre-trained models (mPLMs) have shown impressive performance on cross-lingual transfer tasks. However, the transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language, even though the two languages may be related or share parts of their vocabularies. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method aiming to improve the cross-lingual alignment between languages using diverse scripts. We select two areal language groups, $\textbf{Mediterranean-Amharic-Farsi}$ and $\textbf{South+East Asian Languages}$, wherein the languages are mutually influenced but use different scripts. We apply our method to these language groups and conduct extensive experiments on a spectrum of downstream tasks. The results show that after PPA, models consistently outperform the original model (up to 50\% for some tasks) in English-centric transfer. In addition, when we use languages other than English as sources in transfer, our method obtains even larger improvements. We will make our code and models publicly available.

Paper Type: Long

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: cross-lingual transfer, less-resourced languages, multilingualism, multilingual evaluation, transliteration

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: ara, arb, ary, arz, fas, amh, ell, heb, tur, mlt, zho, lzh, yue, wuu, kor, lao, lhu, mya, bod, tha

Submission Number: 1181

Loading