Abstract: Multilingual sentence encoders (MSEs) are commonly obtained by training multilingual language models to map sentences from different languages into a shared semantic space. As such, they are subject to curse of multilinguality, a loss of monolingual representational accuracy due to parameter sharing. Another limitation of MSEs is the trade-off between different monolingual and cross-lingual performance: training for cross-lingual alignment of sentence embeddings distorts the optimal monolingual structure of semantic spaces of individual languages, harming the utility of sentence embeddings in monolingual tasks; cross-lingual tasks, such as cross-lingual semantic similarity and zero-shot transfer for sentence classification, thus may require different kind of cross-lingual alignment training. In this work, we address both issues by means of modular training of sentence encoders. We first train language-specific monolingual modules to mitigate negative interference between languages (i.e., the curse). We then align all non-English sentence embeddings to the English by training cross-lingual alignment adapters, preventing interference with monolingual specialization from the first step. We train and merge two types of cross-lingual adapters to resolve the conflicting requirements of different cross-lingual tasks. Monolingual and cross-lingual results on semantic text similarity and relatedness, bitext mining and sentence classification tasks show that our modular solution achieves better and more balanced performance across all the tasks compared to full-parameter training of monolithic multilingual sentence encoders, especially benefiting low-resource languages.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingual representations,less-resourced languages
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English,German,French,Czech,Spanish,Arabic,Turkish,Italish,Dutch,Polish,Russian,Chinese,Korean,Azerbaijani,Kazakh,Kyrgyz,Uyghur,Uzbek,Amharic,Telugu,Marathi,Kinyarwanda,Hausa
Submission Number: 1769
Loading