WebPlanner: Task Planning with Autonomous Experience Exploration and Utilization for Real World Multimodal Web Agents
Keywords: Agent, Task Planning
Abstract: Multimodal web agents can assist humans in operating unfamiliar websites and handling repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open‑source multimodal large language models (MLLMs) offer a cost‑efficient alternative to commercial models, they suffer from weak planning ability and limited generalization especially in cross‑website scenarios. To address this, we propose the task decomposition hierarchical analysis framework (TDHAF) to systematically study compositional generalization across three task granularities: low, middle and high levels. And two generalization types: in‑domain and out‑of‑domain. Our analysis reveals that mastering low‑level atomic skills does not guarantee high‑level planning competence, while high‑level task training yields stronger OOD generalization. Motivated by these findings, we introduce the planning experience exploration and utilization (PEEU) method, which enables agents to autonomously set goals, explore unfamiliar environments, and synthesize well‑aligned high‑level task trajectories from extracted experiences. In real‑world multimodal online web navigation, where agents train on one website and are evaluated on 12 unseen websites, PEEU consistently outperforms baselines across model scales (3B, 7B) and training paradigms (SFT, GRPO), reaching 14.9% accuracy, compared to 7.2% and 10.1% for the atomic and basic methods on the GRPO 7B model. These results demonstrate that constructing high‑level tasks and leveraging experiences is crucial for OOD planning abilities of small MLLMs.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10545
Loading