Keywords: large language model, multi-agent, reinforcement learning
Abstract: Large Language Models (LLMs) excel at short-horizon tasks but struggle in complex, long-horizon scenarios involving multi-turn interactions, multi-step reasoning, and selective multi-modal perception. Two core challenges in these settings are effective long-term planning and mitigating cross-modal distraction. Our empirical analysis shows that single LLM agent exhibits steep performance drops as interaction steps increase, underscoring the limitations of monolithic approaches. To overcome these challenges, we propose $\textbf{DEPART}$, a hierarchical multi-agent framework that decomposes planning, action execution, and visual understanding into specialized agents. Through its $\textbf{D}$ivide, $\textbf{E}$valuate, $\textbf{P}$lan, $\textbf{A}$ct, $\textbf{R}$eflect, and $\textbf{T}$rack cycle, DEPART supports dynamic task decomposition, feedback-driven adaptation, and selective vision grounding to reduce cost and improve robustness. Building on this architecture, we introduce Hierarchical Interactive Multi-turn Policy Optimization (HIMPO), a two-round post-training strategy that alternately optimizes planner and executor with dense role-specific and sparse task-level rewards to encourage specialization and coordinated long-horizon reasoning. Across WebArena-Lite and VisualWebArena benchmarks, DEPART with HIMPO consistently outperforms strong single-agent and post-trained baselines.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20133
Loading