Keywords: unified multimodal large language models, alignment
Abstract: Unified Multi-Modal Large Language Models (U-MLLMs) have demonstrated strong capabilities in text-to-image (T2I) generation, but most post-training methods still rely on sparse, image-level rewards and place limited emphasis on safety. In this work, we take an exploratory view of \emph{dense} reward signals for U-MLLMs: token-level feedback derived from existing reward and evaluation models. Rather than proposing a new RL algorithm, We study how dense rewards can be extracted, how they behave, and how they can be integrated into the standard Group Relative Policy Optimization (GRPO) framework. Concretely, we investigate four questions: (1) how to obtain dense token-level rewards from scalar reward models such as HPSv2; (2) what the empirical behavior and distribution of dense rewards over image tokens look like; (3) how to incorporate dense rewards into GRPO via token-weighted advantages while preserving group-wise sample rankings; and (4) how different interpretability methods compare as providers of dense reward, including trade-offs in localization, computational cost, and downstream performance. On WISE and GenAI-Bench, dense-reward variants of a Janus-Pro-7B U-MLLM achieve competitive image quality (e.g., WISE: 0.50) with slightly smoother training dynamics compared to a sparse-reward T2I-R1 baseline. As a preliminary case study, we also instantiate a safety-focused variant that combines safety reward and observe a 59.4\% reduction in unsafe content on the MMDT benchmark relative to the base model. Overall, our results suggest that dense reward is a promising but nuanced design axis for U-MLLM post-training.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 21526
Loading