Keywords: Unified Multimodal Models, Vision Language Models, Generative Models
TL;DR: A post-training method for unified multimodal models' generative ability by self-rewarding.
Abstract: Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate generation and understanding capabilities within a single framework. However, a key challenge remains: a model's powerful understanding often fails to transfer into complex image generation. This often occurs because the understanding and generation modules are trained separately or leading an internal conflict during co-training. As a result, a model can accurately assess a prompt against an image but cannot generate a correct image from that same prompt. To resolve this challenge, we introduce SRUM, the self-rewarding post-training framework designed to improve the model to align its generation with its understanding module. Without needing any new human-labeled data, SRUM creates a self-improvement loop where the model's own understanding module acts as an internal ``evaluator", providing corrective feedback by rewarding to its generation module. Our core innovation is a two-part reward system that offers comprehensive guidance: comprising a \textbf{global reward} for overall compositional structure and a \textbf{local reward} for fine-grained, object-level fidelity. This multi-scale feedback proves critical for complex generation. SRUM sets a new state of the art and strong generaliztion, boosting performance as on T2I-CompBench from 82.18 to \textbf{88.37} and on T2I-ResonBench from 40.7 to \textbf{50.4} in image accuracy. Overall, our work establishes a powerful new paradigm for enabling the UMMs' understanding module to guide its own generation.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 3463
Loading