Knowing When to \texttt{STOP}, \texttt{RECOVER}, and \texttt{SEARCH} \\ A Modular Framework for GUI Automation
Keywords: Computer-use Agent, Multi-modal Agent, Agentic Systems
Abstract: Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery.
We present GUI-Pro-Agent, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search.
First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step—with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence.
Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts.
Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text.
We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required.
We evaluate GUI-Pro-Agent across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (i.e., 72.4%) on OSWorld in a single pass.
In particular, GUI-Pro-Agent with Sonnet 4.6 at only 15 action steps already surpasses the best published 50-step system.
Ablation studies show that all three proposed components consistently improve a strong backbone (e.g., Sonnet 4.6), while a weaker backbone (e.g., Gemini 3 Flash) benefits more from these tools when the step budget is sufficient.
Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 87
Loading