Enhancing Aerial Vision-Language Navigation with Map Grounding and History Awareness

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ES-Reasoning @ ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Model, Aerial Navigation, Visual Language Navigation, Drone Navigation, Synthetic Data Generation, Action Imbalance
TL;DR: This work enhances aerial vision-language navigation by introducing history-aware map inputs, grouping actions to mitigate dataset imbalance, and generating synthetic scenarios, leading to significant gains.
Abstract: Vision-Language Navigation (VLN) for urban UAVs is frequently hindered by ``landmark blindness``, where target landmarks are not visible from the agent's initial viewpoint. We address this by fine-tuning small Vision-Language Models using a ``Map-in-Pixel'' approach that interleaves 16 steps of egocentric visual frames with global geographic snapshots. To mitigate the data scarcity inherent in VLN datasets, we propose a synthetic augmentation strategy that generates diverse, causally consistent trajectories from randomized starting points. Through granular evaluation and targeted trajectory synthesis, we demonstrate that this history-rich training significantly improves the agent's ability to navigate toward distant objects. Our approach achieves a success rate of 12.5\% on the CityNav unseen test set, nearly doubling the baseline (6.4\%), while simultaneously reducing navigation error below baseline levels. This work underscores the efficacy of pixel-encoded maps, temporal history, and targeted data-centric design in empowering small-scale multimodal agents for long-horizon missions.
Submission Number: 74
Loading