IGG: A Benchmark for Interactive GUI Grounding under Visibility Constraints

Published: 27 May 2026, Last Modified: 27 May 2026CompLearn 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: GUI grounding, benchmark, agents, vlm, mllm, interface, gui
TL;DR: IGG is a benchmark for GUI grounding where agents need to interact with the interface to reveal hidden targets before localization.
Abstract: GUI grounding benchmarks evaluate whether a VLM agent can localize a target element from a single static screenshot, typically assuming that the target is already visible. Real-world interfaces, however, often involve visibility constraints such as off-screen targets, hover-based descriptions, occlusions, and delayed activation, so a correct click first requires an interaction that reveals the target. In these settings, grounding requires not only visual matching but also interaction to recover target visibility. We introduce Interactive GUI Grounding (IGG), a benchmark for grounding under limited observability. In IGG, the target is not directly localizable or actionable from the initial screenshot, and agents must expose the target before localization. We define a minimal action space for visibility recovery and a three-level taxonomy of GUI constraints spanning single-state, multi-state, and temporal and advanced settings, with seven sub-types, enabling systematic evaluation of GUI agents under diverse visibility constraints.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 40
Loading