Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models

Julianna Piskorz; Cristina Pinneri; Alvaro Correia; Motasem Alfarra; Risheek Garrepalli; Christos Louizos

Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models

Julianna Piskorz, Cristina Pinneri, Alvaro Correia, Motasem Alfarra, Risheek Garrepalli, Christos Louizos

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: context comprehension, masked diffusion language models, locality biases, sft

Abstract: Masked Diffusion Language Models (MDLMs) have emerged as an alternative to autoregressive language models, with a denoising objective that in principle enables more uniform context utilisation. We study the context comprehension of MDLMs and identify two key limitations. First, despite a more global training objective, MDLMs exhibit a **strong locality bias**: performance depends heavily on the proximity of relevant information to the prediction target. Second, we show that appending **mask tokens—required for generation—can substantially degrade context comprehension**. Through systematic ablations, we find that these masks act as distractors, impairing the model’s ability to process relevant context. To mitigate this effect, we propose a **mask-agnostic loss** that enforces prediction invariance to the number of appended masks. Fine-tuning with this objective significantly improves robustness. Overall, our results reveal important shortcomings of current MDLMs and suggest concrete directions for improving context comprehension.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Style Files: I have used the style files.

Submission Number: 19

Loading