Parallel Sentence Mining Without Parallel Sentences: Scaling to Low-resource Languages

ACL ARR 2026 January Submission1619 Authors

30 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: parallel sentence mining, low-resource languages, multilingual representation
Abstract: Parallel sentence mining aims to find translation pairs from comparable monolingual corpora and creates a valuable dataset for downstream tasks. Yet, it has been mostly considered for high-resource language pairs, and endeavours for low-resource languages still require a significant number of parallel sentences for success, although it remains challenging to gather them. Recent works on multilingual sentence representation focused on three techniques which could address this data constraint: enhancing the isotropy of the embeddings, using contrastive learning, and knowledge distillation. In this study, we hence assess the robustness of the three techniques in a low-resource context. We only use monolingual data in the low-resource language or parallel sentences in relevant but different languages to obtain a sentence representation. We extend and create a benchmark to cover sixteen language pairs with eight low-resource languages from three families. While all three methods improve the representation quality by tackling the underlying cross-lingual misalignment, monolingual pre-training and language proximity are essential factors that lead to better performance. We show a significant increase in mining quality, even in the most difficult language pairs.
Paper Type: Long
Research Area: Multilinguality and Language Diversity
Research Area Keywords: multilingualism, multilingual representations, less-resourced languages
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: Occitan, Upper Sorbian, Lower Sorbian, Chuvash, Corsican, Mingrelian, Gilaki, Mazandarani, Spanish, German, Russian, French, Georgian, Persian, English
Submission Number: 1619
Loading