Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Hongzhi Huang; Defa Zhu; Banggu Wu; Yutao Zeng; Ya Wang; Qiyang Min; zhou Xun

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, zhou Xun

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: The input and output vocabulary is decoupled, where the input vocabularies is scaled up large, achieving performance of double-sized models without extra cost.

Abstract: Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

Lay Summary: Large language models (LLM) like ChatGPT rely on a step called tokenization, breaking text into small pieces with a vocabulary before they can understand and generate language. But we still don’t fully understand how this step affects a model’s performance as it grows larger. In our research, we decouples input and output vocabulary, allowing the model to process a more diverse and fine-grained vocabulary at input. By experimenting with different vocabulary sizes, we found that using larger input vocabulary leads to consistently better performance, even without making the model itself any bigger. This discovery shows that choices in how we carefully break down language can unlock more power from existing models. Our work provides new insights for building the next generation of LLMs.

Primary Area: Deep Learning->Large Language Models

Keywords: LLM, Tokenization, Pre-training, Scaling-laws

Submission Number: 1456

Loading