ShowUI: One Vision-Language-Action Model for Generalist GUI Agent

Published: 22 Oct 2024, Last Modified: 31 Oct 2024NeurIPS 2024 Workshop Open-World Agents OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Graphical User Interface; Language Agent; Vision-Language-Action Models; Computer Usage, Human workflow Automation
TL;DR: An open-source recipe to train Vision-Language-Action models as Generalist GUI agents.
Abstract: Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with digital tasks. While recent Large Language Models (LLMs) and Large Multimodal Models (LMMs) have been used to build autonomous agents capable of solving complex tasks, they often rely on closed-source, API-based solutions and exhibit limitations in GUI-specific interactions. Inspired by the success of Vision-Language-Action (VLA) models in embodied environments, we explore their potential in the digital GUI world. In this work, we develop a recipe for training a VLA for GUI agent –ShowUI, a 4.2B parameter model based on Phi-3.5-vision-instruct. By leveraging scalable GUI visual data (e.g., screenshots with action trajectory), we aim to develop a generalist GUI agent that demonstrates capabilities across diverse dimensions: grounding, navigation, understanding. ShowUI supports various platforms—including websites, desktops, and mobile phones—and accommodates diverse visual inputs such as single-frame images, multiple frames, and videos. We show that ShowUI achieves significant results across multiple benchmarks, including Screenspot, Mind2Web, AITW, AITZ, GUI-Odyssey, and GUI-World. We provide extensive experiments to analyze the impact of different types of training corpus and model design decisions on downstream tasks. The model, code and data will be open-sourced at https://github.com/showlab/ShowUI.
Submission Number: 56
Loading