SenseAct: Structuring GUI Actions for Reliable Planning and Verification

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: GUI Agents, Structured Actions, Long-Horizon Tasks
TL;DR: SenseAct is a GUI agent that formulates the UI representation in a structured symbolic manner and leads to robust performance without leveraging vision-language models.
Abstract: Reliable interaction with graphical user interfaces (GUIs) requires agents to make irreversible control commitments under partial observability, yet most existing GUI agents reduce this problem to step-by-step, perception-conditioned action prediction, leaving decisions implicit and execution unverifiable. We introduce SenseAct, a new paradigm for GUI agents that lifts low-level GUI observations (e.g., XML/DOM trees) into explicit control commitments specifying both controllable interface elements and their expected state transitions. In SenseAct, decision making is formulated over typed control primitives whose dependencies and ordering are explicitly represented, and whose execution is governed by programmatic post-condition predicates that define precise, observable success criteria. This formulation grounds task progress in symbolic state transitions, turning execution verification into a well-defined, deterministic decision rather than a language-based self-judgment. As a consequence of these explicit execution semantics, SenseAct can operate effectively without requiring pixel-level visual reasoning during normal execution and invokes the VLM only when symbolic state transitions violate expected control effects. On the challenging benchmarks DroidTask and AndroidLab, SenseAct achieves a reduction in VLM calls by 14.49% and 30.55% respectively while scoring higher success rates, demonstrating that internalizing programmatic verification within a closed loop control process effectively eliminates execution drift and enhances the efficiency of GUI representations.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 105
Loading