Harnessing Linguistic Dissimilarity for Language Generalization on Unseen Low-Resource Varieties

Published: 14 Dec 2025, Last Modified: 14 Dec 2025LM4UC@AAAI2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: dialects, varieties, language generalization, multilingualism, cross-lingual
TL;DR: We tackle the neglect of low-resource language varieties . We introduce a two-stage Language Generalization framework for unseen varieties, focusing on language-specific cues contrary to alignment-based methods.
Abstract: Despite the fact that much communication takes place in them, low-resource language varieties used in specific regions or by specific groups remain neglected in the development of Multilingual Language Models. A great deal of cross-lingual research focuses on inter-lingual language transfer which strives to align allied varieties and minimize differences between them. However, for low-resource varieties, linguistic dissimilarity is also an important cue allowing generalization to unseen varieties. Unlike prior approaches, we propose a two-stage Language Generalization framework that focuses on capturing variety-specific cues while also exploiting rich overlap offered by high-resource source variety. First, we propose TOPPing, a source-selection method specifically designed for low-resource varieties. Second, we suggest a lightweight VAÇAÍ-Bowl architecture that learns variety-specific attributes with one branch while a parallel branch captures variety-invariant attributes using adversarial training. We evaluate our framework on structural prediction tasks as proxy for performance on other downstream tasks. Using VAÇAÍ-Bowl with TOPPing yields an average 54.62% improvement in the dependency parsing task, which serves as a proxy for performance on other downstream tasks across 10 low-resource varieties.
Submission Number: 26
Loading