Keywords: screen similarity, functional semantics, GUI agents, contrasive learning
Abstract: Recent GUI agent studies show that augmenting LLM prompts with app-related
knowledge constructed during a pre-exploration phase can effectively improve
task success rates. However, retrieving relevant knowledge from the knowledge
base remains a key challenge. Existing approaches often rely on structured
metadata such as view hierarchies, which are frequently unavailable or outdated,
thereby limiting their generalizability. Purely vision-based methods have emerged
to address this issue, yet they typically compare GUI elements only by visual appearance,
leading to mismatches between functionally different elements. We consider
a two-stage retrieval framework, where the first stage retrieves screenshots
sharing the same functional semantics, followed by fine-grained element-level retrieval.
This paper focuses on the first stage by proposing Screen-SBERT, a purely
vision-based method for embedding the functional semantics of GUI screenshots
and retrieving functionally equivalent ones within the same mobile app. Experimental
results on real-world mobile apps show that Screen-SBERT is more effective
than several baselines for retrieving functionally equivalent screenshots. As
a result, (1) we formally define the concepts of functional equivalence and functional
page class; (2) design a contrastive learning-based embedding framework;
and (3) conduct ablation studies that provide insights for future model design.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8953
Loading