DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models
Keywords: diffusion language models, joint embedding predictive architectures, representation learning, fine-tuning, self-supervised learning, training efficiency
TL;DR: We adapt JEPA-style objectives to diffusion language models, improving fine-tuning performance while reducing training cost and maintaining base-model behavior.
Abstract: We introduce DLLM-JEPA, a JEPA formulation for masked diffusion language models. JEPA objectives have so far been applied to autoregressive LMs at the cost of explicit paired views and multiple gradient passes per training step. By leveraging the diffusion noise schedule, DLLM-JEPA constructs two views from a single input without requiring paired data, and reduces JEPA training FLOPs by 33% relative to LLM-JEPA’s two-gradient-view design through a single gradient pass per step.
Across four tasks and two diffusion backbones, DLLM-JEPA consistently improves over diffusion-only fine-tuning, with modest gains in stable settings (e.g., +1.8 pp on GSM8K) and larger improvements under more aggressive fine-tuning, while also tightening seed-to-seed variance on the high-variance LLaDA-8B GSM8K cells. In addition, it does not degrade base-model performance on a held-out diffusion-loss probe nor on a small MMLU sanity check.
We further analyze the representation dynamics induced by the objective and observe a consistent empirical pattern: models trained with DLLM-JEPA exhibit larger geometric drift from their pretrained initialization while maintaining comparable or lower functional forgetting.
These results suggest that DLLM-JEPA provides an efficient way to incorporate representation-level objectives into diffusion language model fine-tuning.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 30
Loading