VEXMLM: Vocabulary Expansion for Multilingual Models to Address Tokenization and OOV Challenges in Underrepresented Languages

VEXMLM: Vocabulary Expansion for Multilingual Models to Address Tokenization and OOV Challenges in Underrepresented Languages

ACL ARR 2025 February Submission92 Authors

02 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multilingual models have shown effectiveness in natural language processing (NLP) tasks, but their performance often declines for low-resource languages due to a predominant focus on high-resource languages during training. This leads to challenges such as out-of-vocabulary (OOV) and over-segmentation, mainly resulting from English-centric tokenization methods. Vocabulary expansion using target language tokens is a common strategy to address these problems. However, existing research mainly focuses on high-resource settings and overlooks the potential of vocabulary expansion to address OOV and over-segmentation in low-resource languages. To fill this gap, we introduce VEXMLM, an enhanced version of XLM-R optimized for low-resource languages through effective vocabulary expansion. Our approach involves creating a human-annotated benchmark dataset and training a language-specific tokenizer by maintaining semantic coherence morphological insights to build comprehensive vocabularies and integrating these tokens into the model via embedding initialization. VEXMLM is evaluated on 19 African languages with varying scripts and resource availability across four tasks: Question Answering, Named Entity Recognition, Sentiment Analysis, and Educational Quality Classification. Comparative experiments demonstrate that VEXMLM significantly outperforms baseline models, XLM-R and Glot500, on low-resource languages while improving performance for high-resource languages. The model, code, and dataset will be publicly available for research.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: Vocabulary expansion,Low-resource languages,Out-of-Vocabulary (OOV), Over-segmentation

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources

Languages Studied: Algerian Arabic (arq), Moroccan Arabic/Darija (ary), Hausa (hau), Tigrinya (tir), Gurage, Afar, Harari, Ge'ez, Swahili (swa), Oromo (orm), Xithonga (tso), Twi (twi), Luganda, Yoruba (yor), Igbo (ibo), Kinyarwanda (kin), Kinyarwanda, Luo, Naija Pidgin, Wolof, Mozambique Portuguese (pt-MZ).

Submission Number: 92

Loading