
# <i>Dustin:</i> Draft-Augmented Sparse Verification for Efficient
Long-Context Generation with Speculative Decoding

<br>

## 0. Environment Setup
### 0.1 Create Conda Virtual Environment
Create and activate a conda virtual environment with Python 3.11:
```bash
conda create -n dustin python=3.11
conda activate dustin
```

### 0.2 Install PyTorch 2.8.0
Install PyTorch with CUDA 12.6 support:
```bash
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126
```

### 0.3 Install Flash Attention
Install flash-attn and its dependencies:
```bash
pip install psutil
pip install flash_attn --no-build-isolation
```

### 0.4 Install Other Dependencies
Install remaining requirements:
```bash
pip install -r requirements.txt
```

## 1. Longbench v1 benchmark
> **⚠️ Configuration Note:** When running Longbench benchmarks, please comment out the following lines in the respective pipeline files (`run/vanilla_mb.py`, `run/classic_seq_sd_mb.py`, `run/targetkv_seq_sd_mb.py`):
> ```python
> # For pg-19
> # self.limit_min_output = True
> # self.min_length = 1024 * 16
> # self.max_new_tokens = 512
> ```

### 1.1 TargetKV-SpecDec
1. Single-doc QA benchmark
```bash
bash run.sh run.targetkv_seq_sd_mb run-benchmark-acc --benchmarks narrativeqa,qasper --max-samples 200
bash run.sh run.targetkv_seq_sd_mb run-benchmark-acc --benchmarks multifieldqa_en --max-samples 150
```
2. Multi-doc QA benchmark
```bash
bash run.sh run.targetkv_seq_sd_mb run-benchmark-acc --benchmarks hotpotqa,2wikimqa,musique --max-samples 200
```
3. Summarization benchmark
```bash
bash run.sh run.targetkv_seq_sd_mb run-benchmark-acc --benchmarks gov_report,qmsum,multi_news --max-samples 200
```
4. Few-shot Learning benchmark
```bash
bash run.sh run.targetkv_seq_sd_mb run-benchmark-acc --benchmarks trec,triviaqa,samsum --max-samples 200
```
5. Synthetic benchmark
```bash
bash run.sh run.targetkv_seq_sd_mb run-benchmark-acc --benchmarks passage_count,passage_retrieval_en --max-samples 200
```
6. Code benchmark
```bash
bash run.sh run.targetkv_seq_sd_mb run-benchmark-acc --benchmarks lcc,repobench_p --max-samples 500
```

### 1.2 Vanilla (Auto-Regressive)
1. Single-doc QA benchmark
```bash
bash run.sh run.vanilla_mb run-benchmark-acc --benchmarks narrativeqa,qasper --max-samples 200
bash run.sh run.vanilla_mb run-benchmark-acc --benchmarks multifieldqa_en --max-samples 150
```
2. Multi-doc QA benchmark
```bash
bash run.sh run.vanilla_mb run-benchmark-acc --benchmarks hotpotqa,2wikimqa,musique --max-samples 200
```
3. Summarization benchmark
```bash
bash run.sh run.vanilla_mb run-benchmark-acc --benchmarks gov_report,qmsum,multi_news --max-samples 200
```
4. Few-shot Learning benchmark
```bash
bash run.sh run.vanilla_mb run-benchmark-acc --benchmarks trec,triviaqa,samsum --max-samples 200
```
5. Synthetic benchmark
```bash
bash run.sh run.vanilla_mb run-benchmark-acc --benchmarks passage_count,passage_retrieval_en --max-samples 200
```
6. Code benchmark
```bash
bash run.sh run.vanilla_mb run-benchmark-acc --benchmarks lcc,repobench_p --max-samples 500
```

### 1.3 Classic-SpecDec
1. Single-doc QA benchmark
```bash
bash run.sh run.classic_seq_sd_mb run-benchmark-acc --benchmarks narrativeqa,qasper --max-samples 200
bash run.sh run.classic_seq_sd_mb run-benchmark-acc --benchmarks multifieldqa_en --max-samples 150
```
2. Multi-doc QA benchmark
```bash
bash run.sh run.classic_seq_sd_mb run-benchmark-acc --benchmarks hotpotqa,2wikimqa,musique --max-samples 200
```
3. Summarization benchmark
```bash
bash run.sh run.classic_seq_sd_mb run-benchmark-acc --benchmarks gov_report,qmsum,multi_news --max-samples 200
```
4. Few-shot Learning benchmark
```bash
bash run.sh run.classic_seq_sd_mb run-benchmark-acc --benchmarks trec,triviaqa,samsum --max-samples 200
```
5. Synthetic benchmark
```bash
bash run.sh run.classic_seq_sd_mb run-benchmark-acc --benchmarks passage_count,passage_retrieval_en --max-samples 200
```
6. Code benchmark
```bash
bash run.sh run.classic_seq_sd_mb run-benchmark-acc --benchmarks lcc,repobench_p --max-samples 500
```

## 2. Efficiency Evaluation (PG19)
> **⚠️ Configuration Note:** When running PG19 benchmark, please uncomment the following lines in the respective pipeline files (`run/vanilla_mb.py`, `run/classic_seq_sd_mb.py`, `run/targetkv_seq_sd_mb.py`):
> ```python
> # For pg-19
> self.limit_min_output = True
> self.min_length = 1024 * 16
> self.max_new_tokens = 512
> ```

### 2.1 TargetKV-SpecDec
```bash
bash run.sh run.targetkv_seq_sd_mb run-benchmark-acc --benchmarks pg19 --max-samples 32
```

### 2.2 Vanilla (Auto-Regressive)
```bash
bash run.sh run.vanilla_mb run-benchmark-acc --benchmarks pg19 --max-samples 32
```

### 2.3 Classic-SpecDec
```bash
bash run.sh run.classic_seq_sd_mb run-benchmark-acc --benchmarks pg19 --max-samples 32
```
