Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

Woojin Chung; Jeonghoon Kim

Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

Woojin Chung, Jeonghoon Kim

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Token Frequency Imbalance, Vocabulary Size, Pre-training Language Model, Information Theory, Tokenization

TL;DR: Larger vocabulary lowers language modeling difficulty by facilitating models to learn non-i.i.d patterns in text more easily

Abstract: Large language models are trained with tokenizers that map text to a fixed vocabulary, yet the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favours ever-larger vocabularies, but it is unclear whether the benefit comes from better word segmentation or from amplifying this frequency skew. To this end, we perform a controlled study that scales the vocabulary of a constant-size Transformer from 24K to 196K symbols while holding data, compute and optimisation unchanged. Above 24K every common word is already a single token, so further growth only increases imbalance. Word-level loss decomposition shows that larger vocabularies reduce cross-entropy almost exclusively by lowering uncertainty on the ~$2,500$ most frequent words, even though loss on the rare tail rises. Same frequent words cover roughly $80\%$ of tokens in downstream benchmarks, this training advantage transfers intact. We further show that enlarging model parameters with a fixed tokenizer yields the same frequent-word benefit, revealing a shared mechanism behind vocabulary and model scaling. Our results recast “bigger vocabularies help” as “sharper frequency imbalance helps,” offering a simple, principled knob for tokenizer–model co-design and clarifying the loss dynamics that govern language-model scaling in pre-training.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 25402

Loading