Abstract: Efficiently navigating through mobile applications to accomplish specific tasks can be time-consuming and challenging, particularly for users who are unfamiliar with the app or faced with intricate menu structures. Simplifying access to a particular screen is a shared user priority, especially for individuals with diverse needs, including those with specific accessibility requirements. This underscores the exploration of innovative solutions to streamline the navigation process. This work addresses the challenge of mapping natural language intents to user interfaces, with a specific focus on the context of mobile applications. The primary objective of this work is to provide users with a more intuitive and efficient method for accessing desired screens in mobile applications by expressing their intentions in natural language. Existing approaches to this task have relied heavily on qualitative human studies for evaluating the performance. Moreover, widely used pre-trained vision-language models, such as Contrastive Language-Image Pretraining (CLIP), struggle to generalize effectively to the unique visual characteristics of user interfaces. Acknowledging the limitations, we introduce a novel approach that harnesses the power of the pre-trained vision-language models. Specifically, we investigate whether fine-tuning pre-trained vision-language models on mobile screens can address the challenges posed by the intricate nature of mobile application interfaces. Our approach involves the utilization of state-of-the-art pre-trained text and image encoders and employing a supervised fine-tuning process, where pre-trained models are adapted to the specific needs of mobile screen interactions. Moreover, a shared embedding space facilitates the alignment of embeddings of both text and image modalities, fostering a cohesive understanding between the natural language intents and visual features of user interface elements. We conduct extensive experimental evaluations using the Screen2Word dataset. Through systematic analysis and established metrics, we examine the models’ ability to accurately map diverse linguistic intents to specific user interfaces. Our analysis demonstrates that fine-tuning yields substantial improvements over the zero-shot performance of the pre-trained vision-language models.
Loading