FigmaBench: Evaluating Design-to-Code Generation in Real-World Handoff Scenarios

FigmaBench: Evaluating Design-to-Code Generation in Real-World Handoff Scenarios

ACL ARR 2026 January Submission7502 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Model; Design-to-Code; Benchmark

Abstract: Automating design-to-code translation remains a longstanding goal in software engineering. While Multimodal Large Language Models (MLLMs) have shown promise in design-to-code tasks, existing benchmarks rely solely on screenshots, discarding the structured metadata indispensable in real-world design handoff workflows. Furthermore, current evaluation paradigms are deficient, as code-level metrics fail to account for implementation variability and visual metrics overlook fine-grained defects, while standardized assessments for responsive behaviors remain absent. To address these limitations, we introduce FigmaBench, an industrial-grade benchmark for multimodal design-to-code generation. We collect raw design files from the Figma Community and apply a multi-stage pipeline to curate 1,234 high-quality samples, each comprising high-resolution screenshots and complete JSON metadata. We further propose a comprehensive evaluation framework with four complementary metrics: visual consistency (VCS), structural layout alignment (SLA), textual and stylistic fidelity (TSF), and responsive quality score (RQS). Through extensive evaluation of state-of-the-art MLLMs, we uncover a critical \textit{fidelity-responsiveness paradox}: models achieving high visual fidelity tend to generate rigid, non-responsive code. We trace this phenomenon to a \textit{metadata trap}, where models shortcut layout reasoning by directly transcribing absolute coordinates rather than generating fluid structures. Our data and code are available at \url{https://anonymous.4open.science/r/FigmaBench-6C84/}.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality; cross-modal content generation; cross-modal application

Languages Studied: HTML, CSS, English

Submission Number: 7502

Loading