TinyClick: Single-Turn Agent for Empowering GUI Automation

Published: 01 Jan 2024, Last Modified: 02 Sept 2025CoRR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We present an UI agent for user interface (UI) interaction tasks, using Vision-Language Model Florence-2-Base. The agent's primary task is identifying the screen coordinates of the UI element corresponding to the user's command. It demonstrates very strong performance on Screenspot and OmniAct annotations, while maintaining a very small size of 0.27B parameters and minimal latency. Moreover, training needs small compute budget of 56 GPU-hours (worth about 40 USD). Relevant improvement comes from vision-specific multi-task training and MLLM-based data augmentation. We hope that decreased needs for expensive compute resources and manually annotated data will allow to facilitate more inclusive and sustainable research of UI agents.
Loading