Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World

Yingzhao Jian; Zhongan Wang; Yi Yang; Hehe Fan

Endowing GPT-4 with a Humanoid Body: Building the Bridge Between Off-the-Shelf VLMs and the Physical World

Yingzhao Jian, Zhongan Wang, Yi Yang, Hehe Fan

Published: 26 Jan 2026, Last Modified: 26 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: human-scene interaction, VLM agent, motion generation

Abstract: In this paper, we explore how to empower general-purpose Vision-Language Models (VLMs) to control humanoid agents. General-purpose VLMs (e.g., GPT-4) exhibit strong open-world generalization, and remove the need for additional fine-tuning data. To build such an agent, two key components are required: (1) an embodied instruction compiler, which enables the VLM to observe the scene and translate high-level user instructions into low-level control parameters; and (2) a motion executor, which generates human motions from these parameters while adapting to real-time physical feedback. We present BiBo, a VLM-driven humanoid agent composed of an embodied instruction compiler and a diffusion-based motion executor. The compiler interprets user instructions in context with the environment, and leverages a chain of visual question answering (VQA) to guide the VLM in specifying control parameters (e.g., motion captions, locations). The diffusion executor extends future joint trajectories from prior motion, conditioned on both control parameters and environmental feedback. Experiments demonstrate that BiBo achieves an interaction task success rate of 90.2\% in open environments, and improves the precision of text-guided motion execution by 16.3\% over prior methods. BiBo handles not only basic interaction but also diverse motions, and even dancing while striking at a sandbag. The code will be released upon publication.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 4884

Loading