iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Agents, Phone Agents, Mobile Agents
Abstract: A useful phone agent will have to be personally intelligent. It must reason over the user’s identity, history, and preferences as they exist on their device, not just instructions in an impersonal sandbox. Existing phone-agent benchmarks evaluate the latter and largely ignore the former. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity that spans 26 newly-built iOS apps. These apps contain interconnected data including transaction histories, messaging threads, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three categories of increasing difficulty: single-app tasks (27), multi-app tasks (60), and memory and personalization tasks (46). We evaluate leading frontier and open-source computer-use models under both vision-only and privileged vision+XML settings. The best configuration reaches 51% overall but only 36% on multi-app tasks. Privileged vision+XML access improves the stronger frontier models by up to 26%, while smaller models do not benefit from the added accessibility-tree input. We release iOSWorld as an open-source benchmark, including all apps, seed data, tasks, rubrics, and evaluation code, to support reproducible research on personally intelligent phone agents
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 126
Loading