CResT: Cross-Query Residual Transformer for Object Goal Navigation

15 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: applications to robotics, autonomy, planning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Object Goal Navigation, Vision Transformer, Visual Encoding, Reinforcement Learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: In this paper, we propose the Cross-Query Residual Transformer (CResT) to extract more sufficient visual features from the image of RGB camera for better navigation results, which includes overall network, Residual Transformer, and Random Mask.
Abstract: Object Goal Navigation (OGN) is the task of navigating from a random location to the target objects in an unknown environment. The end-to-end navigation method decides the actions of the agent to navigate to the target objects and relies much on the state representation obtained by the visual information processing network. In this paper, we propose the Cross-Query Residual Transformer (CResT) to extract more sufficient visual features from the image of the RGB camera for better navigation results, which includes the overall network, Residual Transformer, and Random Mask. In the overall network, the Global Feature and the Local Feature mutually query each other and are subsequently fused for better visual information processing. The Residual Transformer adds residual connections to the Transformer to solve the gradient vanishing problem, which enables the whole network to be trained in one stage without pretraining and allows the Transformer to be several times deeper. The Random Mask is proposed for data augmentation and overfitting reduction. The experiments demonstrate that CResT surpasses the competing methods and achieves state-of-the-art performance on the AI2-THOR dataset. The ablation experiments prove the Residual Transformer and the Random Mask contribute much to the navigation results.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 377
Loading