Abstract: Compared to visual navigation methods based on reinforcement learning that rely on auxiliary information such as depth images, semantic segmentation, object detection, and relational graphs, methods that solely utilize RGB images do not require additional equipment and have better flexibility. However, these methods often suffer from underutilization of RGB image information, resulting in poor generalization performance of the model. To address this limitation, we present the Target-Driven Memory-Augmented (TDMA) framework. This framework utilizes an external memory to store fused Target-Scene features obtained from the observed and target images. To capture and leverage long-term dependencies within this stored data, we employ the Transformer model to process historical information. Additionally, we introduce a self-attention sub-layer in the Decoder section of the Transformer to enhance the model’s focus on similar regions between the observed and target images. Experimental evaluations conducted on the AI2-THOR dataset demonstrate that our proposed method achieves an 8% improvement in success rate and a 16% improvement in success weighted by path length compared to methods in the same experimental setup.
Loading