Plan-and-Paint: Collaborating Semantic and Noise Reasoning for Text-to-Image Generation

18 Sept 2025 (modified: 02 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Text-to-Image Generation; Semantic Reasoning; Noise Reasoning
Abstract: Despite the transformative success of chain-of-thought (CoT) and reinforcement learning (RL) in large language models, their application to visual generation—where reasoning is a critical challenge—remains largely unexplored. In this paper, we present \textbf{Plan-and-Paint}, a novel framework that integrates a dual-level reasoning hierarchy for text-to-image generation. Our framework operates at two critical stages: (1) at the semantic level, an adaptive planner first decomposes the input prompt into a structured generation plan, and (2) at the foundational level, a reinforcement learning agent optimizes the initial noise prior to align with this plan. To seamlessly coordinate these two stages, we introduce a unified reinforcement learning paradigm GRPO to jointly optimizes both the planning coherence and the execution fidelity through a composite reward function. Extensive experiments demonstrate the superiority of our approach: Plan-and-Paint achieves significant improvements on both GenEval (0.87→0.90) and WISE benchmarks. Most importantly, on GenEval benchmark, our method secures the top rank, outperforming a wide range of top-tier open-source and closed-source competitors, including GPT-Image-1 High, Janus-Pro-7B, Qwen-Image, BAGEL, and Seedream 3.0 by a significant margin. Our work advances the state-of-the-art in text-to-image generation, proving that an explicit reasoning hierarchy is key to unlocking controllable and compositional text-to-image generation. To facilitate future research, we will make our code and pre-trained models publicly available.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 11194
Loading