Keywords: Image generation, Reinforcement Learning, LLM multi-agent system
Abstract: Recent advancements in image generation have achieved impressive results in producing high-quality images. However, existing image generation models still generally struggle with a spatial reasoning dilemma, lacking the ability to accurately capture fine-grained spatial relationships from the prompt and correctly generate scenes with structural integrity. To mitigate this dilemma, we propose **RL-RIG**, a **R**einforcement **L**earning framework for **R**eflection-based **I**mage **G**eneration. Our architecture comprises four primary components, which follow a **Generate-Reflect-Edit** paradigm to spark the reasoning ability in image generation for addressing the dilemma. The process repeats the following steps: Generating a new image based on the input prompt and any previously generated image (if available), verifying whether the image satisfies all specified spatial relationships, and generating edit prompts when necessary. The training process is divided into two distinct stages: First, we employ Group Relative Policy Optimization (GRPO) to train the VLM Actor for edit prompts; second, we train the Image Editor for better image quality under a given edit prompt with GRPO. Unlike traditional approaches that solely produce visually stunning yet structurally unreasonable content, our evaluation metrics prioritize spatial accuracy, utilizing Scene Graph IoU and employing a VLM-as-a-Judge strategy to assess the spatial consistency of generated images on LAION-SG dataset. Experimental results exhibit that RL-RIG outperforms existing state-of-the-art open-source models by up to 11% in generating images with controllable and precise spatial reasoning paradigm. \footnote{Code is available at \url{https://anonymous.4open.science/r/RL-RIG-demo-12AE/}}
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9285
Loading