Multilayer Hierarchical Tokenization

Multilayer Hierarchical Tokenization

ACL ARR 2026 January Submission6851 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Tokenization, BPE, LLM, Corpus, Hierarchy

Abstract: Tokenization is a critical component of large language models, yet standard Byte Pair Encoding (BPE) suffers from uncontrolled growth of low-frequency tokens as vocabulary size increases. We propose a multilayer hierarchical BPE framework that explicitly regulates vocabulary growth by treating tokens learned at one layer as atomic symbols at the next, biasing merges toward frequent patterns while suppressing rare tokens. As a proof of concept, we apply this approach to DNA sequence modeling under the DNABERT2 framework. Hierarchical tokenization across four layers reduces corpus size in characters by nearly fourfold while preserving total token count and average tokens per sequence. The method outperforms the DNABERT2 baseline on four of seven GUE tasks, maintains a near-Zipfian token frequency distribution, and shows limited gains on short-sequence tasks. These results demonstrate hierarchical BPE as a principled alternative to vocabulary scaling.

Paper Type: Short

Research Area: LLM Efficiency

Research Area Keywords: Efficient/Low-Resource Methods for NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: DNA sequences as a language

Submission Number: 6851

Loading