Towards Fine-tuning a Small Vision-Language Model for Aerial Navigation

Hakob Tamazyan; Narek Nurijanyan; Boris Martirosyan; Hrant Khachatrian

Towards Fine-tuning a Small Vision-Language Model for Aerial Navigation

Hakob Tamazyan, Narek Nurijanyan, Boris Martirosyan, Hrant Khachatrian

Published: 19 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 Workshop EWMEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Model, Aerial Navigation, Visual Language Navigation, Drone Navigation, Synthetic Data Generation, Action Imbalance

TL;DR: By fine-tuning a small Vision-Language Model and using a synthetic dataset and fixing data issues like action imbalance and overfitting, this work sets a new state-of-the-art 8% success rate on the CityNav aerial navigation benchmark.

Abstract: Visual Language Navigation (VLN) for autonomous robots presents a significant challenge, requiring models to ground textual instructions in visual environments. This paper addresses the CityNav aerial navigation benchmark by fine-tuning a small, open-source Vision-Language Model, Qwen2.5-VL-3B. Our investigation reveals that model performance is critically affected by a severe action imbalance in the training data and is substantially improved by incorporating recent flight trajectory history as an input. By addressing these factors, we achieve an 8% success rate on the Test Unseen split of CityNav, establishing a new state-of-the-art. Despite this result, we observe pronounced overfitting due to data scarcity. To mitigate this limitation, we propose a synthetic data generation strategy focused on explicitly teaching critical navigational skills, such as map interpretation. This work demonstrates that targeted, skill-based data synthesis is a promising direction for building more capable VLN agents.

Submission Number: 68

Loading