FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback
Keywords: front-end, code generation, multi-turn code generation, multi-turn conversation, visual coding
TL;DR: We propose FronTalk, a benchmark for front-end development, to explore a unique setting of conversational code generation with multi-modal feedback.
Abstract: We present **FronTalk**, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: **conversational code generation with multi-modal feedback**. In front-end development, visual artifacts such as sketches, mockups and annotated screenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel *agent-based evaluation framework* leveraging a web agent to simulate users and explore the website, and thus measuring both implementation correctness and user experience. Evaluation of 14 models reveals two key challenges underexplored in the literature: (1) a significant *forgetting issue* where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in *interpreting visual feedback*, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with ACECoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to **nearly zero** and improves the performance by up to **9.3\%** (56.0\%$\rightarrow$65.3\%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 9999
Loading