SCREEN-SBERT: EMBEDDING FUNCTIONAL SEMANTICS OF GUI SCREENS TO SUPPORT GUI AGENTS

ICLR 2026 Conference Submission8953 Authors

17 Sept 2025 (modified: 30 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: screen similarity, functional semantics, GUI agents, contrasive learning
Abstract: Recent GUI agent studies show that augmenting LLM prompts with app-related knowledge constructed during a pre-exploration phase can effectively improve task success rates. However, retrieving relevant knowledge from the knowledge base remains a key challenge. Existing approaches often rely on structured metadata such as view hierarchies, which are frequently unavailable or outdated, thereby limiting their generalizability. Purely vision-based methods have emerged to address this issue, yet they typically compare GUI elements only by visual appearance, leading to mismatches between functionally different elements. We consider a two-stage retrieval framework, where the first stage retrieves screenshots sharing the same functional semantics, followed by fine-grained element-level retrieval. This paper focuses on the first stage by proposing Screen-SBERT, a purely vision-based method for embedding the functional semantics of GUI screenshots and retrieving functionally equivalent ones within the same mobile app. Experimental results on real-world mobile apps show that Screen-SBERT is more effective than several baselines for retrieving functionally equivalent screenshots. As a result, (1) we formally define the concepts of functional equivalence and functional page class; (2) design a contrastive learning-based embedding framework; and (3) conduct ablation studies that provide insights for future model design.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8953
Loading