A Theoretical Analysis of Why Masked Diffusion Models Mitigate the Reversal Curse

Published: 26 May 2026, Last Modified: 01 Jun 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generative modeling, Masked diffusion language models, Reversal curse, Transformer training dynamics, Mechanistic theory, Relative positional encoding
TL;DR: We theoretically explain why masked diffusion models mitigate the reversal curse, showing that the key is not any-order masking alone but position-invariant relational storage and attention-based routing to reversed queries.
Abstract: Autoregressive language models (ARMs) suffer from the reversal curse: after learning "$A$ is $B$," they often fail on the reverse query "$B$ is $A$." Masked diffusion language models (MDMs) exhibit this failure in a much weaker form, but the underlying reason has remained unclear. A common explanation attributes this mitigation to their any-order masked training objective. However, observing "$[\textnormal{\textbf{M}}]$ is $B$" during training teaches recovery of $A$ from $B$ in one positional configuration, and does not by itself explain why the learned evidence should transfer to the reverse prompt "$B$ is $[\textnormal{\textbf{M}}]$." We provide a theoretical analysis showing that this transfer arises from a parameter-level coupling between forward and reverse positional conditionals: shared Transformer parameters store token-pair evidence, while relative positional encodings route attention through queries and keys without changing the value-side evidence being retrieved. In a one-layer MDM, we prove that forward masked training strengthens evidence that is reusable in reverse queries, induces correlated forward--reverse attention routes, and yields a positively aligned shared-storage gradient component that decreases the reverse loss to first order. Controlled one-layer experiments and large-scale LLaDA/Dream experiments verify these signatures and show that they translate into improved reverse prediction.
Submission Number: 133
Loading