Keywords: GUI Grounding, GUI Agent, Computer-Using Agent
Abstract: Graphical User Interface (GUI) grounding is a fundamental task for GUI agents, commonly framed as a coordinate prediction task that identifies an on-screen pixel for actions such as clicks and keystrokes. Though recent Vision Language Models (VLMs) show strong capabilities in understanding GUIs, they often fail in grounding when processing GUIs with high resolution and complex layouts. To address this issue, we reframe GUI grounding as an interactive search task, where the VLM agent outputs actions to move a cursor in the GUI to locate UI elements. At each step, the model determines the target object, evaluates the spatial relations between the cursor and the target, and moves the cursor closer to the target conditioned on the movement history. We train our GUI grounding agent, GUI-Cursor 7B, using multi-step online reinforcement learning with a dense trajectory-based reward function. Our experimental results show that GUI-Cursor 7B achieves state-of-the-art accuracy on ScreenSpot-v2 (93.9\%) and ScreenSpot-Pro (56.5\%). Moreover, the number of movement steps decreases as the grounding accuracy improves during training, and the final model learns to solve the problem within two turns for 95\% of instances and can adaptively conduct more steps on more difficult examples.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 13804
Loading