Your Cursor is Not Secure: Command Line Interface Agent Can Expose Realistic Risks Through Tactics, Techniques, and Procedures

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agent Safety Benchmark, Agent Safety, Red-teaming, CLI Agent
Abstract: Command Line Interface (CLI) agents powered by large language models (LLMs) are rapidly maturing as assistants that operate within a computer's CLI, capable of understanding natural language requests, planning tasks, executing commands, and modifying files and code. Among their most critical applications is operating system (OS) control. As CLI agents in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether they can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures (TTP), incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCLI, the first benchmark aligned with real-world TTPs in MITRE ATT&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CLI agents under a realistic OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing six mainstream CLI agents, including ReAct, AutoGPT, Gemini CLI, Cluade Code, Cursor CLI, and Cursor IDE based on 9 foundation LLMs. The results highlight critical vulnerabilities in current frontier CLI agents, underscoring the urgent need for future research to address alignment vulnerabilities in CLI agents regarding OS security-centric threats.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 225
Loading