TL;DR: We introduce UI-Vision, a comprehensive desktop-centric and license-permissive GUI understanding benchmark
Abstract: Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks—Element Grounding, Layout Grounding, and Action Prediction—with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer-use agents. With UI-Vision, we aim to advance the development of more capable agents for real-world desktop tasks.
Lay Summary: Desktop graphical interfaces (GUIs)—like those used for software applications—are central to how we perform daily tasks such as editing documents or managing files. Yet, automating these desktop tasks with artificial intelligence remains difficult, mainly due to challenges in understanding complex visual information and interactions that users regularly navigate. To address this, we developed UI-Vision, a large-scale dataset that captures detailed interactions with 83 popular desktop software applications. It includes thousands of carefully annotated examples showing how humans interact with these interfaces, such as clicking, dragging, and typing. Our dataset provides benchmarks to assess how well AI models understand and interact with desktop GUIs. Evaluations using UI-Vision reveal significant limitations in existing state-of-the-art AI models, particularly when tasks require understanding professional software tools or performing complex actions like dragging and dropping. By clearly identifying these challenges, UI-Vision helps guide future improvements in AI systems designed to automate and enhance everyday computer use.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/uivision/UI-Vision
Primary Area: Applications->Everything Else
Keywords: Large language models, Multimodal models, LLMs, VLMs, Autonomous agents, GUI
Flagged For Ethics Review: true
Submission Number: 13703
Loading