Reconciling Visual Perception and Generation in Diffusion Models

Reconciling Visual Perception and Generation in Diffusion Models

ICLR 2026 Conference Submission17469 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Perception, Image Classification, Object Detection, Semantic Segmentation

TL;DR: We present GenRep, a unified image understanding and synthesis model that jointly conducts discriminative learning and generative modeling in one training session.

Abstract: We present \textsc{GenRep}, a unified image understanding and synthesis model that jointly conducts discriminative learning and generative modeling in one training session. By leveraging Monte Carlo approximation, \textsc{GenRep} distills distributional knowledge embedded in diffusion models to guide the discriminative learning for visual perception tasks. Simultaneously, a semantic-driven image generation process is established, where high-level semantics learned from perception tasks can be used to inform image synthesis, creating a positive feedback loop for mutual boosts. Moreover, to reconcile the learning process for both tasks, a gradient alignment strategy is proposed to symmetrically modify the optimization directions of perception and generation losses. These designs empower \textsc{GenRep} to be a versatile and powerful model that achieves top-leading performance on both image understanding and generation benchmarks. Code will be released after acceptance.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 17469

Loading