You Don't Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation

Published: 28 Sept 2025, Last Modified: 09 Oct 2025SEA @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Software Engineering, Code Generation, GUI Agent, Large Language Models, Benchmark
Abstract: Large Language Models (LLMs) and code agents in software development are rapidly evolving from generating isolated code snippets to producing full-fledged software applications with graphical interfaces, interactive logic, and dynamic behaviors. However, current benchmarks fall short in evaluating such production-ready software, as they often rely on static checks or binary pass/fail scripts, failing to capture the interactive behaviors and runtime dynamics that define real-world usability—qualities that only emerge when an application is actively used. This is the blind spot of current evaluation: *you don't know if an app works until you click through it, interact with it, and observe how it responds.* To bridge this gap, we introduce **RealDevWorld**, a novel evaluation framework for automated end-to-end assessment of LLMs' ability to generate production-ready repositories from scratch. It features two key components: (1) **RealDevBench**, a diverse collection of 194 open-ended software engineering tasks across multiple domains, incorporating multimodal elements to reflect real-world complexity; and (2) **AppEvalPilot**, a new agent-as-a-judge evaluation system that simulates realistic, GUI-based user interactions to automatically and holistically assess software functional correctness, visual fidelity, and runtime behavior. The framework delivers fine-grained, task-specific diagnostic feedback, supporting nuanced evaluation beyond simple success/failure judgments. Empirical results show that RealDevWorld delivers effective, automatic, and human-aligned evaluations, achieving an accuracy of 0.92 and a correlation of 0.85 with expert human assessments, while significantly reducing the relianc on manual review. This enables scalable, human-aligned assessment of production-level software generated by LLMs.
Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.
Submission Number: 43
Loading