HELIOS: Hierarchical Exploration for Language-grounded Interaction in Open Scenes

04 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: active perception, mobile manipulation, 3D Gaussian splatting
Abstract: Language-specified mobile manipulation tasks in novel environments simultaneously face challenges interacting with a scene which is only partially observed, grounding semantic information from language instructions to the partially observed scene, and actively updating knowledge of the scene with new observations. To address these challenges, we propose, a hierarchical scene representation and associated search objective to perform language specified pick and place mobile manipulation tasks. We construct 2D maps containing the relevant semantic and occupancy information for navigation while simultaneously actively constructing 3D Gaussian representations of task-relevant objects. We fuse observations across this multi-layered representation while explicitly modeling the multi-view consistency of the detections of each object. In order to efficiently search for the target object, we formulate an objective function balancing exploration of unobserved or uncertain regions with exploitation of scene semantic information. We evaluate HELIOS on the OVMM benchmark in the Habitat simulator, a pick and place benchmark in which perception is challenging due to large and complex scenes with comparatively small target objects. HELIOS achieves state-of-the-art results on OVMM. As our approach is zero-shot, HELIOS can also transfer to the real world without requiring additional data, as we illustrate by demonstrating it in a real world office environment on a Spot robot. We evaluate HELIOS on the OVMM benchmark in the Habitat simulator and achieve state-of-the-art results. We also demonstrate HELIOS performing language specified pick and place in a real world office environment on a Spot robot. Our method leverages pretrained VLMs to achieve these results in simulation and the real world without any task specific training.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 2188
Loading