Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic

Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic

ACL ARR 2026 January Submission10582 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: tokenization, vocabulary, embeddings, linear representations, morphology, multilingual, language modeling

Abstract: Large language models (LLMs) encode word-form variation (e.g., *walk* vs *walked*) as linear directions in embedding space. However, standard tokenization algorithms treat these variations as distinct tokens—filling the size-capped vocabulary with surface form variants (e.g., *walk*, *walk**ing***, ***W**alk*) at the expense of less frequent words and multilingual coverage. We show that many of these variations can be captured by *transformation vectors*—additive offsets that yield the appropriate word's representation when applied to the \base word embedding—in both the input and output spaces. Building on this, we propose a compact reshaping of the vocabulary: rather than assigning unique tokens to each surface form, we compose them from shared *base* and *transformation* vectors (e.g., *walked*=*walk*+*past tense*). We apply our approach to multiple LLMs and across five languages, removing up to 10% of vocabulary entries—thereby freeing space to allocate new, more diverse tokens. Importantly, we do so while also expanding vocabulary coverage to out-of-vocabulary words, with minimal impact on downstream performance, while keeping the pretrained backbone frozen and only training lightweight adaptation modules. Our findings motivate a foundational rethinking of vocabulary design, moving from string enumeration to a compositional vocabulary that leverages the underlying structure of language.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: pre-training, morphological inflection, multilingual, language modeling, linear representations

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English, Russian, German, Spanish, Arabic

Submission Number: 10582

Loading