Keywords: context comprehension, masked diffusion language models, locality biases, sft
Abstract: Masked Diffusion Language Models (MDLMs) have emerged as an alternative to autoregressive language models, with a denoising objective that in principle enables more uniform context utilisation. We study the context comprehension of MDLMs and identify two key limitations. First, despite a more global training objective, MDLMs exhibit a **strong locality bias**: performance depends heavily on the proximity of relevant information to the prediction target. Second, we show that appending **mask tokens—required for generation—can substantially degrade context comprehension**. Through systematic ablations, we find that these masks act as distractors, impairing the model’s ability to process relevant context. To mitigate this effect, we propose a **mask-agnostic loss** that enforces prediction invariance to the number of appended masks. Fine-tuning with this objective significantly improves robustness. Overall, our results reveal important shortcomings of current MDLMs and suggest concrete directions for improving context comprehension.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 19
Loading