Keywords: Code Generation; Multimodal Application
Abstract: In recent years, Multimodal Large Language Models (MLLMs) have made substantial progress in visual understanding and language generation, offering new opportunities for automating front-end web development. The WebUI-to-Code task, translating webpage design mockups or screenshots directly into structured HTML, has emerged as a promising paradigm for intelligent front-end engineering. However, existing MLLMs often exhibit significant limitations when applied to real-world webpages with complex layouts and diverse visual styles, including code compilation failures, severe layout misalignments. A key reason for these issues lies in the lack of structured, human-like cognitive processes—namely, the “perceive first, then generate” paradigm commonly followed by human developers. To address this gap, we propose a reinforcement learning framework that explicitly enhances the model’s reasoning ability prior to code generation. Specifically, we introduce a structured layout reasoning stage and design a three-stage reward mechanism to supervise (i) the quality of layout reasoning, (ii) the accuracy of the generated code, and (iii) the consistency between the reasoning and the code. This reward formulation is designed to provide strong positive feedback from the reasoning process to the code generation outcome. To rigorously evaluate our approach, we construct and manually curate a benchmark consisting of 1,800 real-world webpages spanning multiple levels of layout complexity and visual detail. Experimental results demonstrate that our reasoning-enhanced method significantly improves the performance of the baseline model and achieves results comparable to or even surpassing much larger MLLMs baselines in terms of compilation success rate, layout fidelity, and styling accuracy.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 25107
Loading