Spatially Stable GUI Grounding via Zoom Consistency Loss

Moon Ye-Bin; Jiyeon Son; Tae-Hyun Oh

Spatially Stable GUI Grounding via Zoom Consistency Loss

Moon Ye-Bin, Jiyeon Son, Tae-Hyun Oh

Published: 27 May 2026, Last Modified: 16 Jun 2026CompLearn 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: GUI Grounding, Zoom Consistency Loss

Abstract: GUI grounding, the task of localizing target UI elements from natural language instructions on a screenshot, is a core capability for GUI agents, yet remains challenging due to dense layouts and small elements in high-resolution interfaces. While inference-time zoom methods improve accuracy by re-running inference on cropped regions, they require multiple forward passes per grounding call, making them costly for multi-step agent deployment. Through controlled experiments, we find that models already possess sufficient visual understanding of target elements; what they lack is stable spatial focus under cluttered, high-resolution inputs, a problem we term spatial instability. To address this, we propose a Zoom Consistency Loss, a lightweight auxiliary training objective that enforces agreement between predictions on the original screenshot and on zoomed crops of the same image. At inference time, the model requires only a single forward pass with no additional overhead. Experiments across multiple benchmarks show consistent improvements, with particularly strong gains on the high-resolution ScreenSpot-Pro dataset (+3.80), demonstrating zoom consistency as an effective regularizer for spatially stable grounding.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 45

Loading