LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language Generalization

ACL ARR 2024 April Submission877 Authors

16 Apr 2024 (modified: 10 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Pretrained language models (PLMs) have become remarkably adept at task and language generalization. Nonetheless, they often fail dramatically when faced with unseen languages, posing a significant problem for diversity and equal access to PLM technology. In this work, we present LinguAlchemy, a regularization technique that incorporates various aspects of languages covering typological, geographical, and phylogenetic constraining the resulting representation of PLMs to better characterize the corresponding linguistics constraints. LinguAlchemy significantly improves the accuracy performance of mBERT and XLM-R on unseen languages by ~18% and ~2%, respectively compared to fully fine-tuned models and displaying a high degree of unseen language generalization. We further introduce AlchemyScale and AlchemyTune, extension of LinguAlchemy which adjusts the linguistic regularization weights automatically, alleviating the need for hyperparameter search. LinguAlchemy enables better cross-lingual generalization to unseen languages which is vital for better inclusivity and accessibility of PLMs.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingual, low resource
Contribution Types: Approaches to low-resource settings
Languages Studied: Afrikaans, Albanian, Amharic, Arabic, Armenian, Azeri, Bengali, Burmese, Catalan, Chinese, Danish, Dutch, English, Farsi, Finnish, French, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Igbo, Indonesian, Italian, Japanese, Javanese, Kannada, Khmer, Kinyarwanda, Korean, Latvian, Lingala, Luganda, Malay, Malayalam, Marathi, Mongolian, Moroccan Arabic (ary), Nigerian Pidgin (pcm), Norwegian, Oromo (orm), Polish, Portuguese, Punjabi (pan), Romanian, Russian, Rundi (run), Shona (sna), Slovenian, Somali (som), Spanish, Swahili (swa), Swedish, Tagalog, Tamil, Telugu, Thai, Tigrinya (tir), Turkish, Urdu, Vietnamese, Welsh, Xhosa (xho), Yoruba (yor).
Submission Number: 877
Loading