Dual-Path Condition Alignment for Diffusion Transformers

ICLR 2026 Conference Submission19872 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Transformer, Self-Supervised Learning, Representation Learning.
Abstract: Denoising-based generative models have been significantly advanced by representation-alignment (REPA) loss, which leverages pre-trained visual encoders to guide intermediate network features. However, REPA's reliance on external visual encoders introduces two critical challenges: potential \textit{distribution mismatches} between the encoder's training data and the generation target, and the high \textit{computational costs} of pre-training. Inspired by the observation that REPA primarily aids early layers in capturing robust semantics, we propose an unsupervised alternative that avoids external visual encoder and the assumption of consistent data distribution. We introduce \textit{\textbf{DU}al-\textbf{P}ath condition \textbf{A}lignment} (\textbf{DUPA}), a novel self-alignment framework, which independently noises an image multiple times and processes these noisy latents through decoupled diffusion transformer, then aligns the derived conditions\textemdash low-frequency semantic features extracted from each path. Experiments demonstrate that DUPA achieves FID$=$1.46 on ImageNet 256$\times$256 with only 400 training epochs, outperforming all methods that do not rely on external supervision. Critically, DUPA accelerates training of its base model by 5$\times$ and inference by 10$\times$. DUPA is also model-agnostic and can be readily applied to any denoising-based generative model, showcasing its excellent scalability and generalizability.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 19872
Loading