UINavBench: A Framework for Comprehensive Evaluation of Interactive Digital Agents

Abhishek Sundararajan, Harsh Agrawal, Eldon Schoop, Jeffrey Nichols, Alexander T Toshev

Published: 17 Oct 2025, Last Modified: 15 Apr 2026ICCVEveryoneCC BY 4.0

Abstract: We build a comprehensive online evaluation benchmark for language-conditioned multi-step task execution on mobile interfaces. Our benchmark strives to evaluate the multi- step planning, reasoning, and visual grounding capabili- ties of agents, using mobile user interfaces as a concrete testbed. To build diverse, challenging tasks that reflect real-world use cases, we propose an exhaustive taxonomy that allows us to measure progress along multiple decision- making abilities including multi-step planning, visual per- ception, action grounding, and using memory or exter- nal knowledge. We also highlight important factors such as statefulness, safety, and evaluation complexity that are key to design tasks that can be reliably evaluated. Us- ing this taxonomy, we design 116 tasks across 36 unique apps. Through an automatic framework, we stage and eval- uate several natural baselines with different input repre- sentations and planning strategies. We show that the best- performing agent achieves 40% success on our benchmark. We further measure agents’ abilities to plan, ground, and utilize world knowledge highlighting areas of improvement.