Keywords: Applications of interpretability, Feature Geometry
Other Keywords: Masked Diffusion Model, Addition
TL;DR: MDMs generalize better than AR models on high-carry addition because they learn more linearly aligned carry representations, not because they use a different layer-level algorithm.
Abstract: Masked diffusion models (MDMs) have emerged as alternatives to autoregressive (AR) language models, with evidence of stronger generalization under data constraints. We study this gap mechanistically in a controlled addition task: one-layer Transformers add two six-digit numbers after training only on examples with limited carry complexity, then are evaluated on an out-of-distribution carry-generalization split requiring $N_{\mathrm{carry}}>2$. This tests whether models extrapolate the carry rule beyond the carry numbers observed during training. With C2, MDMs outperform AR models on high-carry examples, while C2-Resampled largely closes the gap. We trace this difference to the geometry of carry representations. Attention and MLP sublayers play similar roles in both model classes: attention aggregates base-addition information, while the MLP makes answer tokens more linearly decodable. However, MDM training yields stronger linear alignment, which we defined as the fraction of post-attention representation variance captured along the carry/non-carry direction. Then, theoretically, we show in a Gaussian model that higher linear alignment improves robustness to boundary perturbations. Retraining the MLP while freezing earlier representations preserves the same accuracy ordering, suggesting that MDMs generalize better in this setting because they learn better-aligned carry representations, not a qualitatively different layer-level algorithm.
Submission Number: 597
Loading