Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

ICLR 2026 Conference Submission20050 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: UI understanding, GUI agent, Multi-modal

TL;DR: A comprehensive study on building light-weight, on-device GUI agents

Abstract: Developing autonomous agents that effectively interact with Graphic User Interfaces (GUIs) remains a challenging open problem, especially for small on-device models. In this paper, we present FERRET-UI LITE, a compact, end-to-end GUI agent that operates across diverse platforms, including mobile, web, and desktop. Utilizing techniques optimized for developing small models, we build our 3B FERRET-UI LITE agent through curating a diverse GUI data mixture from real and synthetic sources, strengthening inference-time performance through chain-of-thought reasoning and visual tool-use, and reinforcement learning with designed rewards. FERRET-UI LITE achieves competitive performance with other small-scale GUI agents. In GUI grounding, FERRET-UI LITE attains scores of 91.6%, 53.3%, and 61.2% on the ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI navigation, FERRET-UI LITE achieves success rates of 28.0% on AndroidWorld and 19.8% on OSWorld. We share our methods and lessons learned from developing compact, on-device GUI agents.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 20050

Loading