MoVE: Mixture-of-Vocabulary-Experts for Improved Representation Learning

MoVE: Mixture-of-Vocabulary-Experts for Improved Representation Learning

ICLR 2026 Conference Submission9463 Authors

17 Sept 2025 (modified: 27 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vocabulary scaling, mixture-of-experts, dense retrieval, encoder-only models

TL;DR: We propose Mixture-of-Vocabulary-Experts (MoVE) for scaling vocabularies to 500k tokens in encoder-only models without added inference cost. MoVE boosts sentence embeddings by reducing rare-token undertraining and surpasses models 3× deeper.

Abstract: Vocabulary size is a key design choice in transformers, with recent work showing that larger models benefit from larger vocabularies and achieve better performance at the same training cost. Expanding the output vocabulary is costly due to the softmax computation while input vocabulary scaling is relatively inexpensive as it only involves lookup. However, scaling input vocabulary presents a big challenge---a long tail of rare tokens in the training data receive fewer gradient updates, remain under-trained, and can also adversely impact embedding quality. This is particularly problematic for encoder-only text embedding models, which depend on high-quality token embeddings to build document representations. We present Mixture-of-Vocabulary-Experts (MoVE) architecture for training encoder-only models that enables effective vocabulary scaling by combining tokens of a small base vocabulary to generate a much larger vocabulary, with the help of a Mixture-of-Experts layer. Each expert specializes in a subset of the expanded vocabulary and can generate the trained embeddings offline, leading to no additional overhead during inference. Strong empirical results across the MTEB benchmark show that MoVE trained with vocabulary sizes up to $500\text{k}$ consistently outperforms naive vocabulary scaling, comparable baselines as well as $3\times$ deeper models with base vocabulary, while maintaining lower inference latency.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 9463

Loading