Harnessing Linguistic Dissimilarity for Zero-Shot Language Generalization on Low-Resource Varieties

Harnessing Linguistic Dissimilarity for Zero-Shot Language Generalization on Low-Resource Varieties

ACL ARR 2025 May Submission7736 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Despite the fact that much communication takes place in them, low-resource language varieties used in specific regions or by specific groups remain neglected in the development of Multilingual Language Models. A great deal of cross-lingual researches focus on inter-lingual language transfer which strives to align allied varieties and suppress linguistic differences between them. For low-resource varieties, linguistic dissimilarity is also an important cue allowing generalization to unseen varieties. Unlike prior approaches, we propose a two-stage Language Generalization framework that focuses on capturing variety-specific cues while also exploiting rich overlap offered by high-resource source variety. First, we propose TOPPing, a source-selection method specifically designed for low-resource varieties. Second, we suggest a lightweight VAÇAÍ-Bowl architecture that learns variety-specific attributes with one branch while a parallel branch captures variety-invariant attributes using adversarial training. We evaluate our framework on dependency parsing task as proxy for performance on other downstream tasks. Together, the methods outperform all baselines across 10 low-resource varieties.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: multilingualism, efficient/low-resource methods for nlp, machine learning for nlp

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings

Languages Studied: Gheg, Paraguay Mbya Guarani, Brazil Mbya Guarani, Permyak Komi, Zyrian Komi, Ligurian, Central Alemanic, Skolt Sami, Low Saxon, Umbrian

Submission Number: 7736

Loading