EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization

EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization

ACL ARR 2025 February Submission2330 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Continual pre-training has long been considered the default strategy for adapting models to non-English languages, but struggles with initializing new embeddings, particularly for non-Latin scripts. In this work, we propose EnerGIZAr, a novel methodology that improves continual pre-training by leveraging statistical word alignment techniques. Our approach utilizes GIZA++ to construct a subword-level alignment matrix between source (English) and target language tokens. This matrix enables informed initialization of target tokenizer embeddings, which provides a more effective starting point for adaptation. We evaluate EnerGIZAr against state-of-the-art initialization strategies such as OFA and FOCUS across four typologically diverse languages: Hindi, Basque, Arabic and Korean. Experimental results on key NLP tasks -- including POS tagging, Sentiment Analysis, NLI, and NER -- demonstrate that EnerGIZAr achieves superior monolingual performance while also out-performing all methods for cross-lingual transfer when tested on XNLI. With EnerGIZAr, we propose an intuitive, explainable as well as state-of-the-art initialisation technique for continual pre-training of English models.

Paper Type: Long

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: low-resource language modelling, tokenizer initialisation, cross-lingual NLP

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: Hindi, Basque, Arabic, Korean

Submission Number: 2330

Loading