Keywords: Vision-Language Model, Aerial Navigation, Visual Language Navigation, Drone Navigation, Synthetic Data Generation, Action Imbalance
TL;DR: By fine-tuning a small Vision-Language Model and using a synthetic dataset and fixing data issues like action imbalance and overfitting, this work sets a new state-of-the-art 8% success rate on the CityNav aerial navigation benchmark.
Abstract: Visual Language Navigation (VLN) for autonomous robots presents a significant challenge, requiring models to ground textual instructions in visual environments. This paper addresses the CityNav aerial navigation benchmark by fine-tuning a small, open-source Vision-Language Model, Qwen2.5-VL-3B. Our investigation reveals that model performance is critically affected by a severe action imbalance in the training data and is substantially improved by incorporating recent flight trajectory history as an input. By addressing these factors, we achieve an 8% success rate on the Test Unseen split of CityNav, establishing a new state-of-the-art. Despite this result, we observe pronounced overfitting due to data scarcity. To mitigate this limitation, we propose a synthetic data generation strategy focused on explicitly teaching critical navigational skills, such as map interpretation. This work demonstrates that targeted, skill-based data synthesis is a promising direction for building more capable VLN agents.
Submission Number: 68
Loading