Keywords: UI recognition, Cognitive modeling, AI agents, Generative code, Grounding
Abstract: Grounding is central to AI agents on smartphones and requires recognizing relevant UI elements on graphical interfaces. However, existing grounding methods typically prioritize either efficiency or accuracy, and struggle to balance both under real-world UI variations. To address this challenge, we adopt a dual-system approach: System 1 efficiently recognizes UI elements using predefined rules, while System 2 provides deeper analytical reasoning when System 1 fails. To bridge the two systems, we propose GroundCoder, a multi-agent system that extracts representative UI features (e.g., visual appearance and layout) based on System 2’s reasoning and generates executable code for System 1. The generated code transfers System 2’s analytical capabilities to System 1 by encoding them as executable rules, enabling fast and efficient recognition of UI elements beyond predefined patterns. To systematically evaluate our approach, we construct Eleva, a dataset of UI elements collected from popular mobile applications, covering diverse devices, display modes, and application modes. Experiments on Eleva show that our method preserves efficiency comparable to rule-based methods while improving recognition accuracy by 34.6% over existing mainstream methods. We further discuss implications for using generative code in UI recognition to support more robust grounding in dynamic mobile environments.
Paper Type: Long
Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics
Research Area Keywords: Human-Computer Interaction, cross-modal information extraction, AI agents
Contribution Types: Approaches low compute settings-efficiency, Data resources
Languages Studied: English, Chinese
Submission Number: 3539
Loading