Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World

ICLR 2026 Conference Submission4884 Authors

13 Sept 2025 (modified: 29 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: human-scene interaction, VLM agent, motion generation
Abstract: In this paper, we explore how to empower general-purpose Vision-Language Models (VLMs) to control humanoid agents. General-purpose VLMs (e.g., GPT-4) exhibit strong open-world generalization, and remove the need for additional fine-tuning data. To build such an agent, two key components are required: (1) an embodied instruction compiler, which enables the VLM to observe the scene and translate high-level user instructions into low-level control parameters; and (2) a motion executor, which generates human motions from these parameters while adapting to real-time physical feedback. We present BiBo, a VLM-driven humanoid agent composed of an embodied instruction compiler and a diffusion-based motion executor. The compiler interprets user instructions in context with the environment, and leverages a chain of visual question answering (VQA) to guide the VLM in specifying control parameters (e.g., motion captions, locations). The diffusion executor extends future joint trajectories from prior motion, conditioned on both control parameters and environmental feedback. Experiments demonstrate that BiBo achieves an interaction task success rate of 90.2\% in open environments, and improves the precision of text-guided motion execution by 16.3\% over prior methods. BiBo handles not only basic interaction but also diverse motions, and even dancing while striking at a sandbag. The code will be released upon publication.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 4884
Loading