Track: long paper (up to 8 pages)
Keywords: byte modeling, discrete diffusion, tokenizer-free
TL;DR: We find that masked diffusion models exhibit worse scaling behavior than autoregressive models when trained on raw bytes.
Abstract: Modern language models have historically relied on two dominant design choices: subword tokenization and autoregressive (AR) ordering. Recently, there has been significant research interest in moving toward byte-level modeling to bypass domain-specific vocabularies, as well as masked diffusion models (MDM) to enable parallel non-sequential generation.
Intuitively, the intersection of these paradigms represents a generative ideal: a modality-agnostic system capable of fine-grained any-order generation. However, the computational interaction between these granular representations and non-sequential objectives remains under-explored. In this work, we investigate the viability of this combination through a compute-matched scaling study. We observe a structural dichotomy: AR models on bytes effectively amortize the cost of tokenization, naturally rediscovering sub-word segmentation at scale.
In contrast, byte-level MDMs demand disproportionately more compute to match their BPE counterparts at the compute scales studied and our isoFLOPs studies suggest that they may reach parity only at much higher compute scales.
We attribute this disparity to the masking objective, which shatters the local contiguity required to resolve sub-word semantics from bytes, whereas AR's stable causal history preserves these local dependencies. Our findings inform the community of a critical efficiency tradeoff, suggesting that future modality-agnostic designs should address this context fragility to maintain efficient scaling trajectories.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 37
Loading