# 🚀 DAD-SFT: Dual Attention Distillation for Lightweight UAV Vision-Language Navigation

## 📑 Introduction

DAD-SFT is a novel ​​Supervised Fine-Tuning​​ framework that enables lightweight UAV Vision-Language Navigation (VLN) models for efficient deployment on resource-constrained devices. This method leverages Cross-Modal Attention Distillation (CAD) to guide the student model in learning the teacher model’s semantic focus patterns, and incorporates Contrastive Attention Alignment (CAA) to enhance the discriminative ability of the model using positive and negative samples. Through the synergy of perceptual transfer and discriminative optimization, DAD-SFT significantly improves the cross-modal understanding and generalization ability of lightweight models, enabling efficient and robust navigation on resource-limited devices. Systematic evaluations on the CityNav benchmark demonstrate that our method consistently outperforms mainstream baselines in terms of navigation accuracy, cross-scene generalization, and deployment efficiency, showcasing strong overall performance and practical potential.

## 🛠️ Environment Setup

This project depends on multiple models and tool libraries. It is recommended to use Conda to create an isolated environment.

### Install Conda Environment

```bash
- conda create -n DAD_SFT python=3.10
- conda activate DAD_SFT
- pip install -r requirements.txt
```

---

## 🛠️ Model and Data Preparation

* Download model weights to `./model/`  

* Download data to `./dataset/`

### 📦 Project Structure
```bash
├── dataset/               # Training and evaluation data
│   ├── sftdatabbox.json   # Main sft training data file
│   ├── sftdatabox/        # Image directories for main training data
│   └── test_data/         # Test data for evaluation
├── model/                 # Model weights directory (download manually)
├── src/                   
│   ├── gsamllavanav/      # Navigation and mapping components
│   ├── navgym/            # Navigation gym environment
│   └── uav_vln/           # Core VL model implementation
├── scripts/               # Additional scripts
├── eval.py                # Evaluation script
├── train.py               # Training script
└── train.sh               # Training shell script
```
---

## 🚀 Inference

1. Start the vLLM service
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve path/to/your/model \
  --dtype auto \
  --trust-remote-code \
  --served-model-name qwen_2_5_vl_3b \
  --host 0.0.0.0 \
  -tp 4 \
  --uvicorn-log-level debug \
  --port your_port \
  --limit-mm-per-prompt image=2,video=0 \
  --max-model-len=32000
```

2. Start the inference script

```bash
python eval.py
```

3. Result Visualization  
You can use the visualize_prediction function to visualize the predicted target coordinates and the landmark bounding boxes, as well as the actual target coordinates and landmark bounding boxes.

---

## 🚀 Training

1. Generate training data
```bash
python gen_teacher_attn.py
python gen_neg_attn.py
```

2. Supervised Fine-Tuning (SFT)
```bash
sh train.sh
```

---

