HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities

ICLR 2026 Conference Submission14298 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: computer use agents, llms, evaluation
Abstract: Web applications are prime targets for cyberattacks due to their role as entry points to vital services and sensitive data repositories. Traditional penetration testing is expensive and requires specialized expertise, creating scalability challenges for securing the expanding web ecosystem. While language model agents have shown promise in certain cybersecurity tasks, modern web applications require visual understanding of complex user interfaces, dynamic content rendering, and multi-step interactive workflows that only computer-use agents (CUAs) can handle. Despite CUAs' demonstrated capabilities in web browsing and visual task automation, their potential to discover and exploit web application vulnerabilities through graphical interfaces remains unknown. Understanding these exploitation capabilities is critical as these agents increasingly operate autonomously in vulnerable environments. We introduce HackWorld, the first evaluation framework for systematically assessing computer-use agents' capabilities in exploiting web application vulnerabilities through visual interaction. Unlike existing benchmarks using sanitized environments, HackWorld exposes CUAs to 36 curated applications spanning 11 frameworks and 7 languages, containing realistic vulnerabilities including injection flaws, authentication bypasses, and unsafe input handling. Our framework directly evaluates CUAs' ability to discover and exploit these vulnerabilities using Capture-the-Flag (CTF) methodology while navigating complex web interfaces. Evaluation of state-of-the-art CUAs reveals concerning patterns: CUAs achieve exploitation rates below 12% yet frequently show poor cybersecurity awareness during attempts. They often struggle to plan multi-step attacks and use security tools ineffectively. These findings highlight both the current limitations of CUAs performing security tasks inside web environments. Our results expose CUAs' limited cybersecurity capabilities when operating on vulnerable web applications, opening future research directions on developing security-aware CUAs for vulnerability detection and enhancing their exploitation skills in cybersecurity.
Primary Area: datasets and benchmarks
Submission Number: 14298
Loading