Model-Aware Tokenizer Transfer

Model-Aware Tokenizer Transfer

ICLR 2026 Conference Submission24994 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Tokenizer transfer, Embedding initialization, Attention distillation, Model-aware adaptation, Multilingual NLP, Vocabulary adaptation, Low-resource languages, Mid-resource languages, Model-Aware Tokenizer Transfer, Attention Influence Modeling, Cross-Tokenizer Distillation

TL;DR: This paper introduces Model-Aware Tokenizer Transfer, a method that leverages inter-token communication patterns in attention layers to efficiently adapt pretrained language models to new tokenizers and recover performance across diverse languages.

Abstract: Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer methods typically rely on semantic heuristics to initialize new embeddings, ignoring higher-layer model dynamics and limiting transfer quality. We propose Model-Aware Tokenizer Transfer (MATT), a method that incorporates model internals into the tokenizer transfer process. MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model with a new tokenizer, providing an efficient warm-up before standard language modeling. Unlike approaches that focus solely on embedding similarity, MATT leverages attention behavior to guide embedding initialization and adaptation. Experiments across diverse linguistic settings show that MATT recovers a large fraction of the original model’s performance within a few GPU hours, outperforming heuristic baselines. These results demonstrate that incorporating model-level signals offers a practical and effective path toward robust tokenizer transfer in multilingual LLMs.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 24994

Loading