Label-free GUI Grounding via Confidence-guided Negative Reinforcement Learning

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: GUI Agent, GUI Grounding
Abstract: Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current supervised fine-tuning and reinforcement learning approaches rely heavily on costly annotated data, creating a bottleneck for scaling GUI agents. We introduce a label-free training paradigm leveraging two key insights: (1) coordinate tokens in model outputs exhibit distinct confidence patterns that reliably identify correct predictions, and (2) in sparse GUI coordinate spaces, negative samples provide more reliable learning signals than potentially corrupted positive ones. We propose Confidence-Guided Reinforcement Learning (CRL), which uses coordinate-token confidence to select pseudo-labels from multiple samples and assigns distance-based rewards. We further develop Confidence-Guided Negative Reinforcement Learning (CNRL), which exclusively learns from negative samples. Without using any annotations, CNRL-7B achieves 92.1\% on ScreenSpot-V2, surpassing UI-TARS-72B (90.3\%) trained on 18.4M labels. On ScreenSpot-Pro, CNRL-7B reaches 33.8\%, improving 8.9\% absolute over the base model and exceeding GUI-R1-7B (31.0\%) trained on 3K labels. On challenging high-resolution benchmarks, CNRL consistently outperforms CRL by 1-1.5\%, demonstrating that learning what to avoid can be more effective than learning from uncertain positive examples. Our findings establish coordinate-token confidence as a powerful alternative to manual annotations for scalable GUI agent development.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9012
Loading