SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Weiyang Jin; Yuwei Niu; Jiaqi Liao; Chengqi Duan; Aoxue Li; Shenghua Gao; Xihui Liu

SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Weiyang Jin, Yuwei Niu, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu

09 Sept 2025 (modified: 08 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Unified Multimodal Models, Vision Language Models, Generative Models

TL;DR: A post-training method for unified multimodal models' generative ability by self-rewarding.

Abstract: Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate generation and understanding capabilities within a single framework. However, a key challenge remains: a model's powerful understanding often fails to transfer into complex image generation. This often occurs because the understanding and generation modules are trained separately or leading an internal conflict during co-training. As a result, a model can accurately assess a prompt against an image but cannot generate a correct image from that same prompt. To resolve this challenge, we introduce SRUM, the self-rewarding post-training framework designed to improve the model to align its generation with its understanding module. Without needing any new human-labeled data, SRUM creates a self-improvement loop where the model's own understanding module acts as an internal ``evaluator", providing corrective feedback by rewarding to its generation module. Our core innovation is a two-part reward system that offers comprehensive guidance: comprising a \textbf{global reward} for overall compositional structure and a \textbf{local reward} for fine-grained, object-level fidelity. This multi-scale feedback proves critical for complex generation. SRUM sets a new state of the art and strong generaliztion, boosting performance as on T2I-CompBench from 82.18 to \textbf{88.37} and on T2I-ResonBench from 40.7 to \textbf{50.4} in image accuracy. Overall, our work establishes a powerful new paradigm for enabling the UMMs' understanding module to guide its own generation.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 3463

Loading