Abstract: We build a comprehensive online evaluation benchmark for
language-conditioned multi-step task execution on mobile
interfaces. Our benchmark strives to evaluate the multi-
step planning, reasoning, and visual grounding capabili-
ties of agents, using mobile user interfaces as a concrete
testbed. To build diverse, challenging tasks that reflect
real-world use cases, we propose an exhaustive taxonomy
that allows us to measure progress along multiple decision-
making abilities including multi-step planning, visual per-
ception, action grounding, and using memory or exter-
nal knowledge. We also highlight important factors such
as statefulness, safety, and evaluation complexity that are
key to design tasks that can be reliably evaluated. Us-
ing this taxonomy, we design 116 tasks across 36 unique
apps. Through an automatic framework, we stage and eval-
uate several natural baselines with different input repre-
sentations and planning strategies. We show that the best-
performing agent achieves 40% success on our benchmark.
We further measure agents’ abilities to plan, ground, and
utilize world knowledge highlighting areas of improvement.
Loading