Keywords: Multimodal agent, complex task automation, refined perception, reasoning and planning
Abstract: MLLM-based GUI agents can assist humans in completing various tasks on smart devices automatically, demonstrating significant potential and application value. Unlike smartphones, the PC scenario not only features a more complex interactive environment with denser and more varied UI and text layouts, but also involves more intricate intra- and inter-app workflows, thus posing greater challenges for both perception and decision-making. To address the above issues, we propose a hierarchical agentic framework named PC-Agent. Specifically, from the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content. The APM integrates intention understanding and OCR to achieve fine-grained perception of the content and location of target text, and utilizes the accessibility (A11y) tree to obtain interactive element information. From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture that decomposes decision-making processes into Instruction-Subtask-Action levels. Within this architecture, three agents (i.e., Manager, Progress and Decision) are set up for instruction decomposition, progress tracking and step-by-step decision-making respectively. Additionally, a Reflection agent is adopted to enable timely bottom-up error feedback and adjustment. Alongside the PC-Agent framework, we introduce a new benchmark PC-Eval including 8 widely used applications and 25 real-world complex instructions. Empirical results on PC-Eval show that our PC-Agent achieves a 32\% absolute improvement of task success rate over previous state-of-the-art methods. The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/PC-Agent.
Submission Number: 160
Loading