Keywords: Visual Perception, Image Classification, Object Detection, Semantic Segmentation
TL;DR: We present GenRep, a unified image understanding and synthesis model that jointly conducts discriminative learning and generative modeling in one training session.
Abstract: We present \textsc{GenRep}, a unified image understanding and synthesis model that jointly conducts discriminative learning and generative modeling in one training session. By leveraging Monte Carlo approximation, \textsc{GenRep} distills distributional knowledge embedded in diffusion models to guide the discriminative learning for visual perception tasks. Simultaneously, a semantic-driven image generation process is established, where high-level semantics learned from perception tasks can be used to inform image synthesis, creating a positive feedback loop for mutual boosts. Moreover, to reconcile the learning process for both tasks, a gradient alignment strategy is proposed to symmetrically modify the optimization directions of perception and generation losses. These designs empower \textsc{GenRep} to be a versatile and powerful model that achieves top-leading performance on both image understanding and generation benchmarks. Code will be released after acceptance.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 17469
Loading