Visual Feedback for Self-Improving Text Layout with MLLM via Reinforcement Learning

17 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Laylout, MLLM, RLHF
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a text-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, during the code generation process, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose a self-improving framework that leverages visual feedback for text layout generation. Our method enables the model to iteratively generate layout code, render it into an image, visually evaluate the result, and refine the design through reflection until satisfactory quality is reached. We achieve this through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy and aesthetic measures. Importantly, we demonstrate that simple outcome-based rewards are more effective than complex process-oriented reward functions for iterative generation tasks. Experiments across multiple benchmarks show that our approach significantly outperforms code-only baselines, advanced MLLMs, and existing layout models, establishing Visual Feedback as critical for design-oriented MLLMs.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9159
Loading