UniGP: Taming Diffusion Transformer for Prior Preserved Unified Generation and Perception

17 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion model; Controllable generation; Diffusion-based perception
Abstract: Recent advances in diffusion models have shown impressive performance in controllable image generation and dense prediction tasks (e.g., depth and normal estimation). However, existing approaches typically treat diffusion-based controllable generation and dense prediction as separate tasks, overlooking the potential benefits of jointly modeling the different distributions. In this work, we introduce UNIGP, a framework built upon MMDiT, which unifies controllable generation and dense prediction through simple joint training, without the need for complex task-specific designs or losses, while preserving the backbone’s versatile priors. By learning controllable generation and prediction under different conditions, our model effectively captures the joint distribution of image-geometry pairs. UNIGP is capable of versatile controllable generation (as ControlNet), dense prediction (as Marigold) and joint generation (as JointNet). Specifically, the proposed UNIGP consists of DUGP and a unified dataset training strategy. The former, following the principle of Occam’s razor, uses only a copied image branch of MMDiT to model dense distributions beyond RGB, while the latter integrates different types of datasets into a unified training framework to jointly model generation and perception tasks. Extensive experiments demonstrate that our unified model surpasses prior unified approaches and comparable with specialized methods. Furthermore, we show that through multi-task joint training, the performance of controllable generation and dense prediction can mutually enhance each other.
Supplementary Material: pdf
Primary Area: generative models
Submission Number: 8722
Loading