HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement

Haozhuo Zhang; Jingkai SUN; Michele Caprio; Jian Tang; Shanghang Zhang; Qiang Zhang; Wei Pan

HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement

Haozhuo Zhang, Jingkai SUN, Michele Caprio, Jian Tang, Shanghang Zhang, Qiang Zhang, Wei Pan

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Humanoid, vision-language, object-rearrangement, robot, long-horizon

Abstract: We introduce HumanoidVerse, a novel framework for vision-language guided humanoid control that enables a single physically simulated robot to perform long-horizon, multi-object rearrangement tasks across diverse scenes. Unlike prior methods that operate in fixed settings with single-object interactions, our approach supports consecutive manipulation of multiple objects, guided only by natural language instructions and egocentric camera RGB observations. HumanoidVerse is trained via a multi-stage curriculum using a dual-teacher distillation pipeline, enabling fluid transitions between sub-tasks without requiring environment resets. To support this, we construct a large-scale dataset comprising 350 multi-object tasks spanning four room layouts. Extensive experiments in the Isaac Gym simulator demonstrate that our method significantly outperforms prior state-of-the-art in both task success rate and spatial precision, and generalizes well to unseen environments and instructions. Our work represents a key step toward robust, general-purpose humanoid agents capable of executing complex, sequential tasks under real-world sensory constraints.

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Submission Number: 6918

Loading