DistillMoE: Multi-Faceted Knowledge Distillation for Cross-Tokenizer Embedding Models

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Cross-Tokenizer Knowledge Distillation, Embedding Models, Mixture of Experts
TL;DR: We propose DistillMoE, combining a lightweight Mixture-of-Experts for multi-faceted sequence-level distillation with DynamicCKA token alignment, enabling richer and more effective cross-tokenizer embedding transfer.
Abstract: Cross-Tokenizer Knowledge Distillation for Large Language Models (LLMs), embedding models present significant challenges, primarily due to tokenizer mismatches and limitations of traditional distillation frameworks in capturing the diverse semantic signals encoded by the teacher. We propose DistillMoE, a framework that addresses these challenges through a dual-level strategy. At the sequence level, DistillMoE employs a lightweight Mixture-of-Experts module to distill sentence representations, where each expert specializes in a distinct semantic perspective: pointwise, contrastive, and pairwise. A trainable router assigns inputs to experts, letting each objective be optimized separately, thus enabling seamless integration of diverse losses without heavy tuning. At the token level, we introduce DynamicCKA to align teacher–student hidden states for fine-grained knowledge transfer. This refinement yields teacher-aware sentence embeddings, enabling the MoE to assign more informative expert weightings and enhance multi-faceted distillation. Empirically, when distilling state-of-the-art text embedding models (e.g., LLM2Vec, BGE-M3, Qwen3) into a compact BERT base student, DistillMoE consistently outperforms prior CTKD baselines across multiple datasets. These results demonstrate the effectiveness of combining multi-perspective sequence-level distillation with token-level alignment to obtain compact yet high-fidelity embedding models.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10860
Loading