CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
Keywords: LVLMs, agent, computer use agent
TL;DR: We introduce CODA, a framework that uses decoupled reinforcement learning to train specialist planners by separating high-level planning from low-level execution, then merges their knowledge to create a powerful generalist agent.
Abstract: Autonomous agents for Graphical User Interfaces (GUIs) face significant challenges in novel software, require both long-horizon planning with software domain knowledge and precise, fine-grained execution. Existing approaches suffer from a trade-off: generalist agents excel at planning but falter in execution, while specialized agents show the opposite weakness. While recent compositional frameworks attempt to bridge this gap by combining a "planner" and an "actor", they are typically static and non-trainable, preventing adaptation from experience—a critical limitation given the scarcity of high-quality data in novel software.
To address these limitations, we introduce CODA, a novel and trainable compositional framework that synergizes a generalist planner (Cerebrum) with a specialist executor (Cerebellum), trained with a dedicated two-stage training pipeline. The first stage, Specialization, employs a decoupled GRPO approach to train an expert planner for each novel software individually. The second stage, Generalization, aggregates all positive trajectories from all specialized experts. This consolidated, high-quality dataset is then used to perform supervised fine-tuning (SFT) on the final planner, equipping it with the robust, cross-domain capabilities of a generalist.
Evaluated ScienceBoard benchmark with diversified novel softwares, our framework significantly outperforms the baseline and establishes a new state-of-the-art SOTA among open-source models with strong generalizability to novel software and unseen executor like code agent.
All the code and models will be made publicly available to foster further research.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3273
Loading