Abstract: Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings -- learnable vectors assigned to individual languages.
However, this places a significant burden on token representations to encode all language-specific information, which may hinder language neutrality.
To address this limitation, we propose $\textbf{Lang}$uage-$\textbf{S}$cript $\textbf{A}$ware $\textbf{M}$ultilingual $\textbf{P}$retraining ($\textbf{LangSAMP}$), a method that incorporates both $\textbf{language}$ and $\textbf{script}$ embeddings to enhance representation learning.
Specifically, we integrate these embeddings into the output of the Transformer blocks before passing the final representations to the language modeling head for prediction.
We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages.
The resulting model consistently outperforms the baseline in zero-shot crosslingual transfer across diverse downstream tasks.
Extensive analysis reveals that language and script embeddings capture language- and script-specific nuances, which benefits more language-neutral representations, proved by improved pairwise cosine similarity.
In our case study, we also show language and script embeddings can be used to select better source languages for crosslingual transfer.
We make our code and models publicly available.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingual pre-training, cross-lingual transfer, multilingual representations
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: We consider a wide range of languages (more than 500) from a diverse set of language families in our study.
Submission Number: 402
Loading