Decoupling Planning and Control for Instructable Agents

Zineng Tang; Kelsey R Allen; Sjoerd van Steenkiste; Ishita Dasgupta; Alane Suhr

Decoupling Planning and Control for Instructable Agents

Zineng Tang, Kelsey R Allen, Sjoerd van Steenkiste, Ishita Dasgupta, Alane Suhr

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: reinforcement learning; world model; multi-agent

Abstract: Recent work on vision-language(-action) agents shows that VLMs are strong at high-level reasoning but struggle to realize plans as reliable low-latency action sequences, while world-model controllers excel at fast observation-to-action control but lack open-ended task guidance. In this work, we combine these strengths by conditioning a learned world-model controller on language so that it can act autonomously at high frequency conditioned on sparse, higher-latency textual instructions generated by vision-language models (VLMs). Our system, Speak-to-Act, includes an instructable controller that autoregressively generates high-frequency actions and can either follow language instructions from an instruction agent, or self-operate in a high-throughput environment. To train controllers to be language-instructable, we relabel segments of controller policy rollouts with instructions and optimize a behavior-cloning objective. Our framework easily supports extension to multi-agent settings that enable agent communication between VLMs using trained controllers as actuators without relying on Multi-Agent Reinforcement Learning algorithms. We report results on various embodied environments and tasks, scaling trends with larger controllers and VLMs, and ablations on instruction cadence, planning frequency, and online vs. offline planning latency. The results show that with our decoupled architecture, Speak-to-Act can flexibly switch to different VLMs and scale well to multi-agents and longer chains of reasoning achieving state-of-the-art performance on six tasks.

Primary Area: applications to robotics, autonomy, planning

Submission Number: 23516

Loading