Abstract: The task of object goal navigation is to drive an embodied agent to finding the location of given target only using visual observation. The mapping from visual perception of observation determines the navigation actions. We consider the problem of generalization for the agent across scenes to be lacking good visual perception and spatial reasoning ability. The mutual relationships between edges and objects in the observation is the essential part of scene graph, which reflect the deep understanding of visual perception. Despite recent advances, such as visual transformer and contextual information embedding, the visual perception of graph representation remains a challenging task. In this work, we propose a novel Heterogeneous Zone Graph Visual Transformer formulation for graph representation and visual perception. It consists of two key ideas:1)Heterogeneous Zone Graph (HZG) that explore the heterogeneous target related zones graph and spatial information. It allows the agent to navigate efficiently. 2) Relation-wise Transformer Network (RTN) that transforms the relationship between previously observed objects and navigation actions. RTN extracts rich nodes and edges features as pay more attention on the target-related zone. We model self-attention on the node-to-node encoder and cross-attention on the edge-to-node decoder. The HZG-based model and RTN are shown to improve the agent's policy and to achieve SOTA results on the commonly-used datasets.
0 Replies
Loading