DAD-SFT: Dual Attention Distillation for Lightweight UAV Vision-Language Navigation

06 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: UAV-VLN, Lightweight Model, Knowledge Distillation, Contrastive Learning
TL;DR: We distill large vision-language models into lightweight agents via Dual Attention Distillation, achieving strong performance and even surpassing the teacher on CityNav.
Abstract: In recent years, Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) has attracted increasing attention due to its broad applications in scenarios such as autonomous inspection and emergency rescue. Large-scale Vision-Language Models (VLMs) demonstrate strong cross-modal understanding and reasoning capabilities; however, their massive parameter size and computational demands hinder their deployment on resource-constrained devices. Although lightweight models facilitate efficient deployment, their performance and generalization ability remain limited. To address this challenge, we propose a Dual Attention Distillation into Supervised Fine-Tuning (DAD-SFT) framework. First, Cross-Modal Attention Distillation (CAD) is employed to guide the student model in aligning its semantic focus patterns with those of a powerful teacher model, thereby enhancing its cross-modal perception ability. Meanwhile, we introduce a Contrastive Attention Alignment (CAA) that constructs diverse types of negative samples to strengthen the model’s discriminative capability, which in turn improves generalization under complex scenarios. Systematic evaluations on the CityNav benchmark demonstrate that our method consistently outperforms mainstream baselines in terms of navigation accuracy, cross-scene generalization, and deployment efficiency, showcasing strong overall performance and practical potential. Our code is publicly available for reproducibility.
Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 2592
Loading