Common Sense Language-Guided Exploration and Hierarchical Dense Perception for Instruction Following Embodied Agents

Published: 01 Jan 2024, Last Modified: 16 May 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Embodied Instruction Following (EIF) involves the task of locating and manipulating objects according to language instructions. Existing methods face challenges in small object navigation due to ineffective exploration and imperfect perception, which ultimately affects their performance. This study focuses on small object navigation in the EIF domain. We propose Common Sense Language-guided exploration (CSL), a novel approach that leverages common-sense knowledge from seen scenes and information from language instructions to infer the location of objects. The proposed CSL significantly improves exploration efficiency. Additionally, we propose Hierarchical Dense Perception (HDP), which uses hierarchical features to perform semantic segmentation and depth estimation. The use of HDP significantly improves the agent’s perceptual capabilities. Experiments on the ALFRED benchmark demonstrate the effectiveness of CSL-HDP. The proposed CSL-HDP achieves an absolute improvement of 9.29% (18.45% relative) on unseen test scenes compared to the previous state-of-the-art, securing the top position on the leaderboard. Code will be available at https://github.com/Cyuanwen/CSL-HDP.
Loading