DAD-SFT: Dual Attention Distillation for Lightweight UAV Vision-Language Navigation

Hengxing Cai; Jinhan Dong; Hao Zhang; Yijie Rao; Haidong Wang; Qien Chen; Enming Liang; Agachai Sumalee; Renxin ZHONG

DAD-SFT: Dual Attention Distillation for Lightweight UAV Vision-Language Navigation

Hengxing Cai, Jinhan Dong, Hao Zhang, Yijie Rao, Haidong Wang, Qien Chen, Enming Liang, Agachai Sumalee, Renxin ZHONG

06 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: UAV-VLN, Lightweight Model, Knowledge Distillation, Contrastive Learning

TL;DR: We distill large vision-language models into lightweight agents via Dual Attention Distillation, achieving strong performance and even surpassing the teacher on CityNav.

Abstract: In recent years, Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) has attracted increasing attention due to its broad applications in scenarios such as autonomous inspection and emergency rescue. Large-scale Vision-Language Models (VLMs) demonstrate strong cross-modal understanding and reasoning capabilities; however, their massive parameter size and computational demands hinder their deployment on resource-constrained devices. Although lightweight models facilitate efficient deployment, their performance and generalization ability remain limited. To address this challenge, we propose a Dual Attention Distillation into Supervised Fine-Tuning (DAD-SFT) framework. First, Cross-Modal Attention Distillation (CAD) is employed to guide the student model in aligning its semantic focus patterns with those of a powerful teacher model, thereby enhancing its cross-modal perception ability. Meanwhile, we introduce a Contrastive Attention Alignment (CAA) that constructs diverse types of negative samples to strengthen the model’s discriminative capability, which in turn improves generalization under complex scenarios. Systematic evaluations on the CityNav benchmark demonstrate that our method consistently outperforms mainstream baselines in terms of navigation accuracy, cross-scene generalization, and deployment efficiency, showcasing strong overall performance and practical potential. Our code is publicly available for reproducibility.

Supplementary Material: zip

Primary Area: transfer learning, meta learning, and lifelong learning

Submission Number: 2592

Loading