Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling

Published: 10 Jun 2025, Last Modified: 23 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Byte Language Models, Hierarchical Transformers, Long Context Window, MegaByte, MambaByte
TL;DR: A modality-agnostic and hierarchical byte language model that scales to training with context window of 5M tokens
Abstract: Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Yet the excessive length of bytestreams requires new architectural paradigms for Byte Language Models. Therefore, we present the Multiscale BLM (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of 5M bytes on single GPU in full precision. Our experiments demonstrate that hybrid Transformer/Mamba architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. Source code has already been publicly released at https://github.com/ai4sd/multiscale-byte-lm and MBLM can be installed from PyPI.
Submission Number: 31
Loading