Spatially Stable GUI Grounding via Zoom Consistency Loss

Published: 27 May 2026, Last Modified: 16 Jun 2026CompLearn 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: GUI Grounding, Zoom Consistency Loss
Abstract: GUI grounding, the task of localizing target UI elements from natural language instructions on a screenshot, is a core capability for GUI agents, yet remains challenging due to dense layouts and small elements in high-resolution interfaces. While inference-time zoom methods improve accuracy by re-running inference on cropped regions, they require multiple forward passes per grounding call, making them costly for multi-step agent deployment. Through controlled experiments, we find that models already possess sufficient visual understanding of target elements; what they lack is stable spatial focus under cluttered, high-resolution inputs, a problem we term spatial instability. To address this, we propose a Zoom Consistency Loss, a lightweight auxiliary training objective that enforces agreement between predictions on the original screenshot and on zoomed crops of the same image. At inference time, the model requires only a single forward pass with no additional overhead. Experiments across multiple benchmarks show consistent improvements, with particularly strong gains on the high-resolution ScreenSpot-Pro dataset (+3.80), demonstrating zoom consistency as an effective regularizer for spatially stable grounding.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 45
Loading