Computer-Use Agent Frameworks Can Expose Realistic Risks Through Tactics, Techniques, and Procedures

Weidi Luo; Qiming Zhang; Tianyu Lu; Xiaogeng Liu; CHIU Hung Chun; Siyuan Ma; Bin Hu; Yizhe Zhang; Xusheng Xiao; Yinzhi Cao; Zhen Xiang; Chaowei Xiao

Computer-Use Agent Frameworks Can Expose Realistic Risks Through Tactics, Techniques, and Procedures

Weidi Luo, Qiming Zhang, Tianyu Lu, Xiaogeng Liu, CHIU Hung Chun, Siyuan Ma, Bin Hu, Yizhe Zhang, Xusheng Xiao, Yinzhi Cao, Zhen Xiang, Chaowei Xiao

01 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Computer-Use Agent, Attack

Abstract: Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can perceive context, reason, and act directly within software environments. Among their most critical applications is operating system (OS) control. As CUAs in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether CUAs can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures~(TTP), Incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCUA, the first benchmark aligned with real-world TTPs in MITRE ATT\&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CUAs under a realistic enterprise OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing five mainstream \CUAs, including ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE based on 8 foundation LLMs. On TTP tasks, Cursor CLI achieves the highest average ASR at 69.59\%, notably surpassing ReAct-based CUA at 52.29\% and Cursor IDE at 51.66\%. For end-to-end kill chain tasks, Cursor IDE attains the highest average ASR at 34.62\%, followed by Cursor CLI at 26.93\% and ReAct-based CUA at 23.37\% on all evaluated LLMs. The results demonstrate that current frontier CUAs do not adequately cover OS security-centric threats. These capabilities of CUAs reduce dependence on custom malware and deep domain expertise, enabling even inexperienced attackers to mount complex enterprise intrusions, which raises social concern about the responsibility and security of CUAs.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 656

Loading