TL;DR: We introduce R3-Bench and a self-evolving RL framework that enables multimodal models to autonomously diagnose and correct visual generation errors for higher-quality images.
Abstract: Text-to-Image (T2I) models and Unified Multimodal Models (UMMs) have achieved remarkable progress in visual generation.
However, their reliance on a single-pass generation paradigm limits their ability to handle complex prompts requiring iterative refinement. To enable multi-round Reflective Visual Generation (RVG), we formalize the *Reason--Reflect--Rectify* (R$^3$) loop as a core framework and introduce R$^3$-Bench, a benchmark of over 600 expert-annotated instances that quantifies iterative reasoning and rectification capabilities.
Evaluation on R$^3$-Bench reveals a critical gap: while state-of-the-art models can identify generation errors, they fail to generate actionable rectification instructions.
To bridge this gap, we propose R$^3$-Refiner, a dual-stage framework leveraging Group Relative Policy Optimization (GRPO) and a Hierarchical Reward Mechanism (HRM) to better align rectification with reflective reasoning.
Experiments show that R$^3$-Refiner achieves significant improvements on R$^3$-Bench (+12.0% in Reflective Verdict Score, +9.0% in Rectification Score),
and can be seamlessly integrated with various MLLMs to enhance the generation quality of different T2I models on GenEval++ and T2I-CompBench.
Code is available at https://github.com/xiaomoguhz/R3-Bench.
Lay Summary: Although many text-to-image systems can generate high-quality images, they still struggle with complex prompts involving multiple attributes, such as object counts, colors, or spatial arrangements. Even when a system detects visual discrepancies in a suboptimal image, it often fails to provide actionable instructions for correction.
This paper investigates the gap between visual error detection and subsequent rectification. We introduce R3-Bench, a benchmark comprising 670 expertly curated instances that evaluates the capability of models to verify image-text alignment, explain discrepancies, and propose revision strategies. Furthermore, we propose R3-Refiner, a training methodology designed to translate visual reasoning capabilities into explicit and actionable rectification instructions.
Extensive experiments demonstrate that our approach significantly enhances both error diagnosis and image correction while generalizing across various image generation frameworks. These advancements enhance the reliability of text-to-image tools for diverse applications, though deployment must be accompanied by robust safeguards against misleading synthetic content.
Link To Code: https://github.com/xiaomoguhz/R3-Bench
Primary Area: Applications->Computer Vision
Keywords: Multimodal, Visual Generation, Large Language Models
Originally Submitted PDF: pdf
Submission Number: 6997
Loading