Overcoming Vocabulary Mismatch: Vocabulary-agnostic Teacher Guided Language Modeling

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), enabling guidance from any teacher model during student model pretraining, regardless of vocabulary mismatch.
Abstract: Using large teacher models to guide the training of smaller student models has become the prevailing paradigm for efficient and effective learning. However, vocabulary mismatches between teacher and student language models pose significant challenges in language modeling, resulting in divergent token sequences and output distributions. To overcome these limitations, we propose Vocabulary-agnostic Teacher Guided Language Modeling (VocAgnoLM), a novel approach that bridges the gap caused by vocabulary mismatch through two key methods: (1) Token-level Lexical Alignment, which aligns token sequences across mismatched vocabularies, and (2) Teacher Guided Loss, which leverages the loss of teacher model to guide effective student training. We demonstrate its effectiveness in language modeling with 1B student model using various 7B teacher models with different vocabularies. Notably, with Qwen2.5-Math-Instruct, a teacher model sharing only about 6% of its vocabulary with TinyLlama, VocAgnoLM achieves a 46% performance improvement compared to naive continual pretraining. Furthermore, we demonstrate that VocAgnoLM consistently benefits from stronger teacher models, providing a robust solution to vocabulary mismatches in language modeling.
Lay Summary: Large language models (LLMs) often pass down their knowledge to smaller models through a process called distillation. However, even as more powerful LLMs emerge, this process remains challenging, mainly because different models often use different vocabularies. If the models "speak" in different units, like using different sets of words, it's hard for one to teach the other effectively. VocAgnoLM introduces a new way to overcome this challenge. Instead of relying on shared vocabularies, it uses character-level alignment to extract and transfer meaningful information. This allows smaller models to learn effectively even when their vocabulary doesn't overlap with that of the teacher model. For example, when a large model demonstrates how to solve a math problem, VocAgnoLM helps the smaller model understand and rephrase it using its own language system. Experiments show that VocAgnoLM outperforms conventional distillation methods, even in cases where the teacher and student models share very little vocabulary. This research helps bridge the gap between different LLM natures and supports the development of smaller, more efficient AI systems.
Primary Area: Deep Learning->Large Language Models
Keywords: Knowledge Distillation, Cross-Tokenizer, Vocabulary-agnostic Pretraining, Teacher Guided Language Modeling
Submission Number: 10011
Loading