Enhancing Visual Aligning and Grounding for Aerial Vision-and-Dialog Navigation

Guanhui Qiao, Dong Yi, Lingxiang Wu, Hanxiao Wu, Jinqiao Wang

Published: 01 Jan 2025, Last Modified: 25 Jan 2026IEEE Signal Processing LettersEveryoneRevisionsCC BY-SA 4.0
Abstract: Vision-and-Language Navigation tasks require an agent to navigate to a destination following natural language instructions. We focus on a challenging VLN dataset, Aerial Vision-and-Dialog Navigation, which encompasses a diverse array of environments and includes an additional altitude variable. Significant spatial and scale variations in the aerial agent’s view make destination visual grounding a crucial capability for the navigation task. However, the existing frameworks not only have insufficient attention to the vision model, but also lack the correlations between visual and textual modalities. To address this, we propose a model that aligns destination visual images with navigation instructions, featuring three innovative components. Firstly, we propose a multi-stage pre-training pipeline that enhances the model’s ability to associate language instructions with top-view images of destinations. Secondly, trajectories are augmented elastically to simulate the noise of the controlling process. Finally, a polygon regression loss function is introduced for rotated object detection, which significantly enhances the accuracy of altitude and orientation estimation. Experiments demonstrate the effectiveness of our approach, which achieves state-of-the-art advancements with improvements of 2.9% in the val unseen dataset and 3.0% in test unseen dataset in success weighted by path length.
Loading