HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY-NC 4.0
Keywords: Human-Scene Interaction; Object Rearrangement; Vision-Language-Action Model; Physical Humanoid
TL;DR: We introduce HumanVLA, which performs a varity of object rearrangement tasks directed by vision and language by physical humanoid.
Abstract: Physical Human-Scene Interaction (HSI) plays a crucial role in numerous applications. However, existing HSI techniques are limited to specific object dynamics and privileged information, which prevents the development of more comprehensive applications. To address this limitation, we introduce HumanVLA for general object rearrangement directed by practical vision and language. A teacher-student framework is utilized to develop HumanVLA. A state-based teacher policy is trained first using goal-conditioned reinforcement learning and adversarial motion prior. Then, it is distilled into a vision-language-action model via behavior cloning. We propose several key insights to facilitate the large-scale learning process. To support general object rearrangement by physical humanoid, we introduce a novel Human-in-the-Room dataset encompassing various rearrangement tasks. Through extensive experiments and analysis, we demonstrate the effectiveness of our approach.
Supplementary Material: zip
Primary Area: Reinforcement learning
Submission Number: 299
Loading