SCREEN-SBERT: EMBEDDING FUNCTIONAL SEMANTICS OF GUI SCREENS TO SUPPORT GUI AGENTS

Minsu Song; HyunJin Park; Kang Hoon Lee

SCREEN-SBERT: EMBEDDING FUNCTIONAL SEMANTICS OF GUI SCREENS TO SUPPORT GUI AGENTS

Minsu Song, HyunJin Park, Kang Hoon Lee

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: screen similarity, functional semantics, GUI agents, contrasive learning

Abstract: Recent GUI agent studies show that augmenting LLM prompts with app-related knowledge constructed during a pre-exploration phase can effectively improve task success rates. However, retrieving relevant knowledge from the knowledge base remains a key challenge. Existing approaches often rely on structured metadata such as view hierarchies, which are frequently unavailable or outdated, thereby limiting their generalizability. Purely vision-based methods have emerged to address this issue, yet they typically compare GUI elements only by visual appearance, leading to mismatches between functionally different elements. We consider a two-stage retrieval framework, where the first stage retrieves screenshots sharing the same functional semantics, followed by fine-grained element-level retrieval. This paper focuses on the first stage by proposing Screen-SBERT, a purely vision-based method for embedding the functional semantics of GUI screenshots and retrieving functionally equivalent ones within the same mobile app. Experimental results on real-world mobile apps show that Screen-SBERT is more effective than several baselines for retrieving functionally equivalent screenshots. As a result, (1) we formally define the concepts of functional equivalence and functional page class; (2) design a contrastive learning-based embedding framework; and (3) conduct ablation studies that provide insights for future model design.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 8953

Loading