ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Chenchen Zhang; Yuhang Li; Can Xu; Jiaheng Liu; Ao Liu; Changzhi Zhou; Ken Deng; Dengpeng Wu; Guanhua Huang; Kejiao Li; Qi Yi; Ruibin Xiong; Shihui Hu; Yue Zhang; Yuhao Jiang; Zenan Xu; Yuanxing Zhang; Wiggin Zhou; Bo Zhou; Fengzong Lian

ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

11 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Artifacts, ArtifactsBench, MLLM, Code

Abstract: The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and largely overlook the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a benchmark and automated, multimodal evaluation paradigm for visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior via temporal (three-step) screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We curate 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves 94.4% ranking consistency with WebDev Arena—a de facto gold standard for human preferences in web development—and up to 90.95% pairwise agreement with human experts. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://anonymous.4open.science/r/ArtifactsBench-F7F9, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.

Primary Area: datasets and benchmarks

Submission Number: 4022

Loading

ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Changzhi Zhou, Ken Deng, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Shihui Hu, Yue Zhang, Yuhao Jiang, Zenan Xu, Yuanxing Zhang, Wiggin Zhou, Bo Zhou, Fengzong Lian