The Efficiency Gap in Byte Modeling

Published: 02 Mar 2026, Last Modified: 02 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0
Track: long paper (up to 8 pages)
Keywords: byte modeling, discrete diffusion, tokenizer-free
TL;DR: We find that masked diffusion models exhibit worse scaling behavior than autoregressive models when trained on raw bytes.
Abstract: Modern language models typically rely on two design choices: subword tokenization and autoregressive (AR) ordering. To achieve more universal modeling, the field is advancing toward byte-level modeling to bypass domain-specific vocabularies and masked diffusion models (MDM) to enable parallel non-sequential generation. Intuitively, the intersection of these paradigms represents a generative ideal: a modality-agnostic system capable of fine-grained any-order generation. However, the computational interaction between these granular representations and non-sequential objectives remains under-explored. In this work, we investigate the viability of this combination through a compute-matched scaling study. We observe a structural dichotomy: AR models on bytes effectively amortize the cost of tokenization, naturally rediscovering sub-word segmentation at scale. In contrast, byte-level MDMs suffer a non-convergent efficiency collapse. We attribute this disparity to the masking objective, which shatters the local contiguity required to resolve sub-word semantics from bytes, whereas AR's stable causal history preserves these essential local dependencies. Our findings inform the community of a critical efficiency tradeoff, suggesting that future modality-agnostic designs should address this context fragility to maintain efficient scaling.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 37
Loading