Keywords: parallel sentence mining, low-resource languages, multilingual representation
Abstract: Parallel sentence mining aims to find translation pairs from comparable monolingual corpora and creates a valuable dataset for downstream tasks.
Yet, it has been mostly considered for high-resource language pairs, and endeavours for low-resource languages still require a significant number of parallel sentences for success, although it remains challenging to gather them.
Recent works on multilingual sentence representation focused on three techniques which could address this data constraint: enhancing the isotropy of the embeddings, using contrastive learning, and knowledge distillation.
In this study, we hence assess the robustness of the three techniques in a low-resource context. We only use monolingual data in the low-resource language or parallel sentences in relevant but different languages to obtain a sentence representation.
We extend and create a benchmark to cover sixteen language pairs with eight low-resource languages from three families.
While all three methods improve the representation quality by tackling the underlying cross-lingual misalignment, monolingual pre-training and language proximity are essential factors that lead to better performance.
We show a significant increase in mining quality, even in the most difficult language pairs.
Paper Type: Long
Research Area: Multilinguality and Language Diversity
Research Area Keywords: multilingualism, multilingual representations, less-resourced languages
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: Occitan, Upper Sorbian, Lower Sorbian, Chuvash, Corsican, Mingrelian, Gilaki, Mazandarani, Spanish, German, Russian, French, Georgian, Persian, English
Submission Number: 1619
Loading