CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning

ACL ARR 2025 February Submission4083 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The advancement of visual language models (VLMs) has enhanced mobile device operations, allowing simulated human-like actions to address user requirements. Current VLM-based mobile operating assistants can be structured into three levels: task, subtask, and action. The subtask level, linking high-level goals with low-level executable actions, is crucial for task completion but faces two challenges: \textbf{ineffective subtasks} that lower-level agent cannot execute and \textbf{inefficient subtasks} that fail to contribute to the completion of the higher-level task. These challenges stem from VLM’s lack of experience in decomposing subtasks within GUI scenarios in multi-agent architecture. To address these, we propose a new mobile assistant architecture with \textbf{c}onstrained \textbf{h}igh-frequency \textbf{o}ptimized \textbf{p}lanning (CHOP). Our approach overcomes the VLM's deficiency in GUI scenarios planning by using human-planned subtasks as the ``basis vector''. We evaluate our architecture in both English and Chinese contexts across 20 Apps, demonstrating significant improvements in both effectiveness and efficiency. Our dataset and code is available at \textcolor{blue}{\url{https://anonymous.4open.science/r/CHOP-667F}}
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond,NLP Applications
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English,Chinese
Submission Number: 4083
Loading