Keywords: Tokenization, BPE, LLM, Corpus, Hierarchy
Abstract: Tokenization is a critical component of large language models, yet standard Byte Pair Encoding (BPE) suffers from uncontrolled growth of low-frequency tokens as vocabulary size increases. We propose a multilayer hierarchical BPE framework that explicitly regulates vocabulary growth by treating tokens learned at one layer as atomic symbols at the next, biasing merges toward frequent patterns while suppressing rare tokens. As a proof of concept, we apply this approach to DNA sequence modeling under the DNABERT2 framework. Hierarchical tokenization across four layers reduces corpus size in characters by nearly fourfold while preserving total token count and average tokens per sequence. The method outperforms the DNABERT2 baseline on four of seven GUE tasks, maintains a near-Zipfian token frequency distribution, and shows limited gains on short-sequence tasks. These results demonstrate hierarchical BPE as a principled alternative to vocabulary scaling.
Paper Type: Short
Research Area: LLM Efficiency
Research Area Keywords: Efficient/Low-Resource Methods for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: DNA sequences as a language
Submission Number: 6851
Loading