Keywords: Computer-Using Agents, Large Language Models, Benchmark, AI Safety, Misuse Risk, Sandbox Evaluation, Terminal Agents, GUI Agents, Agentic Frameworks, Monitoring
Abstract: Computer-using agents (CUAs), which can autonomously control computers to perform multi-step actions, might pose significant safety risks if misused. However, existing benchmarks primarily evaluate language models' (LMs) safety risks in chatbots or simple tool-usage scenarios. To more comprehensively evaluate CUAs' misuse risks, we introduce a new benchmark: CUAHarm. CUAHarm consists of 104 expert-written realistic misuse risks, such as disabling firewalls, leaking confidential user information, launching denial-of-service attacks, or installing backdoors into computers. We provide a sandbox environment to evaluate these CUAs' risks. Importantly, we provide rule-based verifiable rewards to measure CUAs' success rates in executing these tasks (e.g., whether the firewall is indeed disabled), beyond only measuring their refusal rates. We evaluate multiple frontier open-source and proprietary LMs, such as Claude 4 Sonnet, GPT-5, Gemini 2.5 Pro, Llama-3.3-70B, and Mistral Large 2. Surprisingly, even without carefully designed jailbreaking prompts, these frontier LMs comply with executing these malicious tasks at a high success rate (e.g., 90\% for Gemini 2.5 Pro). Furthermore, while newer models are safer in previous safety benchmarks, their misuse risks as CUAs become even higher. For example, Gemini 2.5 Pro completes 5 percentage points more harmful tasks than Gemini 1.5 Pro. In addition, we find that while these LMs are robust to common malicious prompts (e.g., creating a bomb) when acting as chatbots, they could still provide unsafe responses when acting as CUAs. We further evaluate a leading agentic framework (UI-TARS-1.5) and find that while it improves performance, it also amplifies misuse risks. To mitigate the misuse risks of CUAs, we explore using LMs to monitor CUAs' actions. We find monitoring unsafe computer-using actions is significantly harder than monitoring conventional unsafe chatbot responses. While monitoring chain-of-thoughts leads to modest gains, the average monitoring accuracy is only 77\%. A hierarchical summarization strategy improves performance by up to 13\%, a promising direction though monitoring remains unreliable. The benchmark will be released publicly to facilitate further research on mitigating these risks.
Primary Area: datasets and benchmarks
Submission Number: 4920
Loading