NeighXLM: Enhancing Cross-Lingual Transfer in Low-Resource Languages via Neighbor-Augmented Contrastive Pretraining

ACL ARR 2025 May Submission157 Authors

08 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent progress in multilingual pretraining has yielded strong performance on high-resource languages, albeit with limited generalization to genuinely low-resource settings. While prior approaches have attempted to enhance cross-lingual transfer through representation alignment or contrastive learning, they remain constrained by the extremely limited availability of parallel data to provide positive supervision in target languages. In this work, we introduce NeighXLM, a neighbor-augmented contrastive pretraining framework that enriches target-language supervision by mining semantic neighbors from unlabeled corpora. Without relying on human annotations or translation systems, NeighXLM exploits intra-language semantic relationships captured during pretraining to construct high-quality positive pairs. The approach is model-agnostic and can be seamlessly integrated into existing multilingual pipelines. Experiments on Swahili demonstrate the effectiveness of NeighXLM in improving cross-lingual retrieval and zero-shot transfer performance.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: less-resourced languages; cross-lingual transfer; data augmentation; data-efficient training; NLP in resource-constrained settings; contrastive learning;
Contribution Types: Approaches to low-resource settings
Languages Studied: Swahili, English, French, Russian, Arabic, Spanish, Japanese
Keywords: less-resourced languages, cross-lingual transfer, data augmentation, data-efficient training, NLP in resource-constrained settings, contrastive learning
Submission Number: 157
Loading