Embodied Contrastive Learning with Geometric Consistency and Behavioral Awareness for Object Navigation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Object Navigation (ObjcetNav), which enables an agent to seek any instance of an object category specified by a semantic label, has shown great advances. However, current agents are built upon occlusion-prone visual observations or compressed 2D semantic maps, which hinder their embodied perception of 3D scene geometry and easily lead to ambiguous object localization and blind exploration. To address these limitations, we present an Embodied Contrastive Learning (ECL) method with Geometric Consistency (GC) and Behavioral Awareness (BA), which motivates agents to actively encode 3D scene layouts and semantic cues. Driven by our embodied exploration strategy, BA is modeled by predicting navigational actions based on multi-frame visual images, as behaviors that cause differences between adjacent visual sensations are crucial for learning correlations among continuous visions. The GC is modeled as the alignment of behavior-aware visual stimulus with 3D semantic shapes by employing unsupervised contrastive learning. The aligned behavior-aware visual features and geometric invariance priors are injected into a modular ObjectNav framework to enhance object recognition and exploration capabilities. As expected, our ECL method performs well on object detection and instance segmentation tasks. Our ObjectNav strategy outperforms state-of-the-art methods on MP3D and Gibson datasets, showing the potential of our ECL in embodied navigation. The experimental code is available as supplementary material.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This work investigates object recognition and object goal navigation tasks. This paper contributes to multimedia/multimodality in four aspects: (1) We first propose a 2D-3D cross-modal Embodied Contrastive Representation Learning (ECRL) method with Geometric Consistency (GC) and Behavioral Awareness (BA). The 2D scene understanding is enhanced by introducing geometric and view-invariant priors into the behavior-aware 2D visual features, which improves the accuracy of object recognition. (2) We propose a modular object goal navigation policy based on RGB visual images, 2D semantic maps, and 3D point cloud features. Benefiting from 2D-3D ECRL pre-training and multimodal information processing of the navigation strategy, our approach achieves state-of-the-art navigation performance. (3) A curiosity and action-aware exploration strategy is proposed to support the ECRL by collecting diverse action-vision pairs online. The sufficient action-modal and visual-modal features provide rich feature bases for the BA and GC modeling. (4) Sufficient comparative and ablative studies on object detection, instance segmentation, and object goal navigation tasks demonstrate the superiority of our multimodal information processing and fusion techniques.
Supplementary Material: zip
Submission Number: 416
Loading