Keywords: GUI Agents, Computer-Use Agents, Multi-Modal Agents
Abstract: Multi-modal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid control—seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing 17,000+ verifiable tasks spanning real-world computer-use scenarios; (3) a multi-agent system generating high-quality hybrid control trajectories with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 27% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid control mechanism proves critical, reducing error propagation while maintaining execution efficiency. This work establishes a scalable paradigm that bridges primitive GUI interactions and programmatic intelligence for stronger and unified computer use.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15862
Loading