Visually-grounded Humanoid Agents

Hang Ye; Fan Lu; Xiaoxuan Ma; Wayne Wu; Kwan-Yee Lin; Yizhou Wang

Visually-grounded Humanoid Agents

Hang Ye, Fan Lu, Xiaoxuan Ma, Wayne Wu, Kwan-Yee Lin, Yizhou Wang

09 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Humanoid Agents, Embodied AI, 3D Human, Motion Generation, 3D Scene Reconstruction

Abstract: Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are **passively** animated, relying on privileged state or scripted control, which limits scalability to novel, unseen environments. We instead ask: how can digital humans **actively** behave using only *visual observations* and *specified goals* in novel scenes? Achieving this would enable populating any 3D environments with any digital humans, at scale, that exhibit spontaneous, natural, goal-directed behaviors. To this end, we introduce **Visually-grounded Humanoid Agents**, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they *look, perceive, reason, and behave* like real people in real-world 3D scenes. The World Layer provides a structured substrate for interaction, by reconstructing semantically rich 3D Gaussian scenes from real-world videos via an occlusion-aware semantic scene reconstruction pipeline, and accommodating animatable Gaussian-based human avatars within them. The Agent Layer transforms these avatars into autonomous humanoid agents, equipping them with first-person RGB-D perception and enabling them to perform accurate, embodied planning with spatial-awareness and iterative reasoning, which is then executed at the low level as full-body actions to drive their behaviors in the scene. We further introduce a benchmark to evaluate humanoid–scene interaction within reconstructed 3D environments. Experimental results demonstrate that our agents achieve robust autonomous behavior through effective planning and action execution, yielding higher task success rates and fewer collisions compared to both ablations and state-of-the-art planning methods. This work offers a new perspective on populating scenes with digital humans in an active manner, enabling more research opportunities for the community and fostering human-centric embodied AI.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 3271

Loading