Visual Navigation by Fusing Object Semantic Feature

Published: 01 Jan 2024, Last Modified: 12 Apr 2025SMC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The key of object goal visual navigation is to learn the spatial relationships between environmental objects and assess their semantic correlations with the target object. We propose an end-to-end visual navigation model based on deep reinforcement learning, called G2SNet, which consists of two feature maps and a specialized fusion feature network: GloVe Feature Map (GFM), Sbbox Feature Map (SFM), and GloVe fusion Network (GNet). GFM represents the position and the semantic information of the objects contained in the observation image, which addresses the issue of the interference in the target recognition caused by the complex background information in the observed image. SFM provides the object sizes in the field of view to assist the distance judgment. GNet relies entirely on network learning to compute the semantic correlations and spatial positional relationships among objects in the environment, enabling the agent to possess better generalization capabilities. This allows learning of spatial relationships between objects in GFM. Experiments on AI2-THOR demonstrate the effectiveness of our proposed three new structures, and the average SPL of the four known scenarios is increased by 16.6%.
Loading