Keywords: reinforcement learning; world model; multi-agent
Abstract: Recent work on vision-language(-action) agents shows that VLMs are strong at high-level reasoning but struggle to realize plans as reliable low-latency action sequences, while world-model controllers excel at fast observation-to-action control but lack open-ended task guidance. In this work, we combine these strengths by conditioning a learned world-model controller on language so that it can act autonomously at high frequency conditioned on sparse, higher-latency textual instructions generated by vision-language models (VLMs). Our system, Speak-to-Act, includes an instructable controller that autoregressively generates high-frequency actions and can either follow language instructions from an instruction agent, or self-operate in a high-throughput environment. To train controllers to be language-instructable, we relabel segments of controller policy rollouts with instructions and optimize a behavior-cloning objective. Our framework easily supports extension to multi-agent settings that enable agent communication between VLMs using trained controllers as actuators without relying on Multi-Agent Reinforcement Learning algorithms. We report results on various embodied environments and tasks, scaling trends with larger controllers and VLMs, and ablations on instruction cadence, planning frequency, and online vs. offline planning latency. The results show that with our decoupled architecture, Speak-to-Act can flexibly switch to different VLMs and scale well to multi-agents and longer chains of reasoning achieving state-of-the-art performance on six tasks.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 23516
Loading