Figma2Code: Automating Multimodal Design to Code in the Wild

Yi Gui; Jiawan Zhang; Yina Wang; Tianran Ma; Yao Wan; Shilin He; Dongping Chen; Zhou Zhao; Wenbin Jiang; Xuanhua Shi; Hai Jin; Philip S. Yu

Figma2Code: Automating Multimodal Design to Code in the Wild

Yi Gui, Jiawan Zhang, Yina Wang, Tianran Ma, Yao Wan, Shilin He, Dongping Chen, Zhou Zhao, Wenbin Jiang, Xuanhua Shi, Hai Jin, Philip S. Yu

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Code Generation, Desigin to Code

Abstract: Front-end development constitutes a substantial portion of software engineering, yet converting design mockups into production-ready *User Interface* (UI) code remains tedious and time-costly. While recent work has explored automating this process with *Multimodal Large Language Models* (MLLMs), existing approaches typically rely solely on design images. As a result, they must infer complex UI details from images alone, often leading to degraded results. In real-world development workflows, however, design mockups are usually delivered as Figma files—a widely used tool for front-end design—that embed rich multimodal information (e.g., metadata and assets) essential for generating high-quality UI. To bridge this gap, we introduce Figma2Code, a new task that generalizes *design-to-code* into a multimodal setting and aims to automate *design-to-code* in the wild. Specifically, we collect paired design images and their corresponding metadata files from the Figma community. We then apply a series of processing operations, including rule-based filtering, human and MLLM-based annotation and screening, and metadata refinement. This process yields 3,055 samples, from which designers curate a balanced dataset of 213 high-quality cases. Using this dataset, we benchmark ten state-of-the-art open-source and proprietary MLLMs. Our results show that while proprietary models achieve superior visual fidelity, they remain limited in layout responsiveness and code maintainability. Further experiments across modalities and ablation studies corroborate this limitation, partly due to models’ tendency to directly map primitive visual attributes from Figma metadata.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 920

Loading