NavFormer: A Transformer Architecture for Robot Target-Driven Navigation in Unknown and Dynamic Environments

Haitong Wang, Aaron Hao Tan, Goldie Nejat

Published: 03 Jun 2024, Last Modified: 08 Jun 2024IEEE Robotics and Automation LettersEveryoneRevisionsCC BY-NC 4.0

Abstract: In unknown cluttered and dynamic environments such as disaster scenes, mobile robots need to perform targetdriven navigation in order to find people or objects of interest, where the only information provided about these targets are images of the individual targets. In this paper, we introduce NavFormer, a novel end-to-end transformer architecture developed for robot target-driven navigation in unknown and dynamic environments. NavFormer leverages the strengths of both 1) transformers for sequential data processing and 2) selfsupervised learning (SSL) for visual representation to reason about spatial layouts and to perform collision-avoidance in dynamic settings. The architecture uniquely combines dual-visual encoders consisting of a static encoder for extracting invariant environment features for spatial reasoning, and a general encoder for dynamic obstacle avoidance. The primary robot navigation task is decomposed into two sub-tasks for training: single robot exploration and multi-robot collision avoidance. We perform cross-task training to enable the transfer of learned skills to the complex primary navigation task. Simulated experiments demonstrate that NavFormer can effectively navigate a mobile robot in diverse unknown environments, outperforming existing state-of-the-art methods. A comprehensive ablation study is performed to evaluate the impact of the main design choices of NavFormer. Furthermore, real-world experiments validate the generalizability of NavFormer.