Abstract: Recent work on cross-lingual sentence representation focused on contrastive learning as an alternative to pre-training based on parallel sentences due to their scarcity, especially for lower-resourced languages.
In this study, we assess the robustness of two contrastive learning strategies which either use transliteration or natural language inference datasets to create positive and negative pairs. Instead of sentence matching, we evaluate the quality of the more complex parallel sentence mining task on five language pairs with low-resource (and endangered) languages: Lower Sorbian-German, Chuvash-Russian, Corsican-French, Mingrelian-Georgian, and Mingrelian-English.
We find that while contrastive learning based on NLI is better overall and improves the representation quality, it remains effective mostly for our experiments on language pairs in the same script or language family.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingual representations,contrastive learning,less-resourced languages,multilingual evaluation
Languages Studied: Lower Sorbian, German, Chuvash, Russian, Corsican, French, Mingrelian, Georgian, English
Submission Number: 4560
Loading