Token Distillation: Attention-Aware Input Embeddings for New Tokens

Token Distillation: Attention-Aware Input Embeddings for New Tokens

ICLR 2026 Conference Submission20072 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: embedding initialization, tokenizer, vocabulary adaptation

TL;DR: We propose Token Distillation to obtain high-quality input embeddings for new tokens by distilling representations obtained using the original tokenization.

Abstract: Current language models rely on static vocabularies determined at pretraining time, which can lead to decreased performance and increased computational cost for domains underrepresented in the original vocabulary. New tokens can be added to solve this problem, when coupled with a good initialization for their new embeddings. However, existing embedding initialization methods require expensive further training or pretraining of additional modules. In this paper, we propose Token Distillation and show that by distilling representations obtained using the original tokenization, we can quickly learn high-quality input embeddings for new tokens. Experimental results with a wide range of open-weight models show that Token Distillation outperforms even strong baselines.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 20072

Loading