Keywords: Artifacts, ArtifactsBench, MLLM, Code
Abstract: The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and largely overlook the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a benchmark and automated, multimodal evaluation paradigm for visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior via temporal (three-step) screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We curate 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves 94.4% ranking consistency with WebDev Arena—a de facto gold standard for human preferences in web development—and up to 90.95% pairwise agreement with human experts. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://anonymous.4open.science/r/ArtifactsBench-F7F9, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.
Primary Area: datasets and benchmarks
Submission Number: 4022
Loading