GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning

ICLR 2026 Conference Submission19359 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: GUI Visual Grounding, Reinforcement Fine-Tuning
Abstract: Graphical user interface visual grounding (GUI-VG)—a core capability for GUI agents—has primarily relied on supervised fine-tuning (SFT) of multimodal large language models (MLLMs), demanding extensive data curation and significant training costs. However, as MLLMs continue to advance and even cover GUI domains during pretraining, the necessity of exhaustive SFT post-training becomes increasingly questionable. Meanwhile, the recent successes of rule-based reinforcement fine-tuning (RFT) suggest a more efficient alternative. However, despite its promise, the optimal manner of RFT for GUI-VG remains unexplored. To bridge this gap, we introduce GuirlVG, a reinforcement learning–based GUI-VG method built on a systematic empirical study and a novel stabilization technique. Preliminarily, we find that naive application of RFT underperforms the SFT baseline, motivating a deeper exploration of RFT. First, we decompose RFT into its core components and analyze the optimal formulation of each. Second, as part of this exploration, we propose a novel Adversarial KL Factor that dynamically stabilizes training to mitigate reward over-optimization. Third, we further explore the training configurations of RFT to enhance the effectiveness. Extensive experiments show that GuirlVG, with only 5.2K training samples, outperforms SFT methods trained on over 10M samples, achieving a +7.7% improvement on ScreenSpot, a +17.2% improvement on ScreenSpotPro and 91.9% accuracy on ScreenSpotV2.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19359
Loading