Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling

Eric Egli; Matteo Manica; Jannis Born

Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling

Eric Egli, Matteo Manica, Jannis Born

Published: 10 Jun 2025, Last Modified: 23 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Byte Language Models, Hierarchical Transformers, Long Context Window, MegaByte, MambaByte

TL;DR: A modality-agnostic and hierarchical byte language model that scales to training with context window of 5M tokens

Abstract: Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Yet the excessive length of bytestreams requires new architectural paradigms for Byte Language Models. Therefore, we present the Multiscale BLM (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of 5M bytes on single GPU in full precision. Our experiments demonstrate that hybrid Transformer/Mamba architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. Source code has already been publicly released at https://github.com/ai4sd/multiscale-byte-lm and MBLM can be installed from PyPI.

Submission Number: 31

Loading