## Main Results

DART significantly boosts the performance of various backbones on the ImageNet-1K dataset while efficiently managing computational resources.

### Performance on Transformers and SSMs

DART consistently improves Top-1 accuracy for DeiT, Vim, and VideoMamba models. Notably, during long-sequence fine-tuning, DART achieves superior or comparable accuracy with a substantial reduction in GFLOPs.

| Backbone                 | Tokenizer          | Params (M) | Patches          | GFLOPs                   | Top-1 (%)             |
| ------------------------ | ------------------ | ---------- | ---------------- | ------------------------ | --------------------- |
| *Transformers* |                    |            |                  |                          |                       |
| DeiT-Ti                  |                    | 6          | 196              | 1.26                     | 72.2                  |
| DeiT-Ti                  | **DART** | 7          | 196              | 1.32                     | 73.8 **(+1.6)** |
| DeiT-S                   |                    | 22         | 196              | 4.61                     | 79.8                  |
| DeiT-S                   | **DART** | 24         | 196              | 4.84                     | 80.6 **(+0.8)** |
| DeiT-S†                  |                    | 22         | 576              | 15.5                     | 81.6                  |
| DeiT-S†                  | **DART** | 24         | **392** | 10.1 **(-35%)** | 81.8 **(+0.2)** |
| *SSMs* |                    |            |                  |                          |                       |
| Vim-Ti                   |                    | 7          | 196              | 1.60                     | 76.1                  |
| Vim-Ti                   | **DART** | 8          | 196              | 1.68                     | 77.2 **(+1.1)** |
| Vim-S                    |                    | 26         | 196              | 5.30                     | 80.5                  |
| Vim-S                    | **DART** | 29         | 196              | 5.55                     | 81.5 **(+1.0)** |
| VideoMamba-Ti            |                    | 7          | 196              | 1.08                     | 76.9                  |
| VideoMamba-Ti            | **DART** | 8          | 196              | 1.15                     | 78.2 **(+1.3)** |
| Vim-Ti†                  |                    | 7          | 784              | 5.95                     | 78.3                  |
| Vim-Ti†                  | **DART** | 8          | **392** | 3.29 **(-45%)** | 78.9 **(+0.6)** |
| Vim-S†                   |                    | 26         | 784              | 19.6                     | 81.6                  |
| Vim-S†                   | **DART** | 29         | **392** | 10.9 **(-44%)** | 82.2 **(+0.6)** |
| VideoMamba-Ti†           |                    | 7          | 784              | 4.30                     | 79.3                  |
| VideoMamba-Ti†           |                    | 7          | 1296             | 7.11                     | 79.6                  |
| VideoMamba-Ti†           | **DART** | 8          | **392** | 2.24 **(-69%)** | 79.7 **(+0.1)** |

> † denotes long‐sequence fine‐tuning.

### Comparison with Dynamic Tokenizers

DART demonstrates superior performance and efficiency compared to other dynamic inference methods for ViT.

| Model                 | Patches  | GFLOPs | Acc. (%) |
| --------------------- | -------- | ------ | -------- |
| A-ViT-T               | dynamic  | 0.8    | 71.0     |
| **DeiT-Ti + DART** | **121** | **0.8**| **71.8** |
| **DeiT-Ti + DART** | **196** | **1.3**| **73.8** |
| DeiT-S                | 196      | 4.61   | 79.8     |
| DynamicViT-S/0.5      | dynamic  | 7.0    | 80.3     |
| **DeiT-S + DART** | **196** | **4.8**| **80.6** |
| **DeiT-S + DART** | **392** | **10.1**| **81.8** |

### Ablation on Scoring Network

The choice of scoring backbone in DART offers a trade-off between parameter count and accuracy improvement.

| Scoring Network   | Params (M) | FLOPs | Top-1 (%)       |
| ----------------- | ---------- | ----- | --------------- |
| w/o (DeiT-Ti baseline)    | 6          | 1.26  | 72.2            |
| MobileNetV3 Small | 7          | 1.32  | 73.8 **(+1.6)** |
| MnasNet           | 7          | 1.37  | 74.0 **(+1.8)** |
| SqueezeNet        | 7          | 1.54  | 74.3 **(+2.1)** |
| EfficientNet-B0   | 10         | 2.41  | 75.1 **(+2.9)** |

---


## Installation

Setting up the environment requires a multi-step process, as several core components must be compiled from source. Please follow these steps in order.

### Step 1: Create Conda Environment

We recommend using a virtual environment.

```bash
conda create -n dart_env python=3.10
conda activate dart_env

# Install PyTorch (adjust for your CUDA version if necessary)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```

### Step 2: Install Dependencies from Source

Before installing the packages from `requirements.txt`, you should manually install the following dependencies. Failure to do so will result in errors.

- **NVIDIA Apex (Optional but Recommended):** For mixed-precision training.
  ```bash
  # Follow the official guide for your system: https://github.com/NVIDIA/apex
  git clone https://github.com/NVIDIA/apex
  cd apex
  pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
  cd ..
  ```

- **Mamba SSM:** This project's Mamba-based backbones (Vim, VideoMamba) rely on a custom-compiled version of `mamba_ssm`. Please follow the installation instructions provided in their official repositories to build this package.
  - For **Vim**: [https://github.com/hustvl/Vim](https://github.com/hustvl/Vim)
  - For **VideoMamba**: [https://github.com/OpenGVLab/VideoMamba](https://github.com/OpenGVLab/VideoMamba)


- **Swin Window Process (for Swin backbone experiments):** If you plan to run experiments on Swin Transformer, you need to build its custom CUDA operator.
  *Please refer to the official Swin Transformer repository for instructions on building this component.*

### Step 3: Install Remaining Dependencies

Once the special dependencies are installed, you can install all remaining packages using the provided `requirements.txt` file.

```bash
pip install -r requirements.txt
```

## Video Training
For video fine-tuning, we follow the training recipes from VideoMamba. We only add a new flag `--num_patches` to control the total number of dynamic patches. See `video/README.md` for a runnable example.
- Reference: [VideoMamba](https://github.com/OpenGVLab/VideoMamba)

## Training and Evaluation

The training and evaluation scripts are adapted from the DeiT repository.

### Training

Use the following command for multi-GPU training. Hyperparameters should align with the baseline models.
The model name --model darvit_sm corresponds to the "DeiT-S + DART" entry in our paper's tables.
```bash
python -m torch.distributed.launch --nproc_per_node=2 --master-port=29577 --use_env main.py \
--model darvit_sm \
--batch-size 256 \
--data-path /path/to/your/imagenet \
--data-set IMNET \
--output_dir /path/to/save/models \
--input-size 448
```

### Evaluation

To evaluate a pretrained model, use the `--eval` flag.

```bash
python -m torch.distributed.launch --nproc_per_node=2 --master-port=29577 --use_env main.py \
--model darvit_sm \
--batch-size 256 \
--data-path /path/to/your/imagenet \
--data-set IMNET \
--resume /path/to/your/model.pth \
--input-size 448 \
--eval
```

## Integrating DART with New Models

The core logic of DART is encapsulated in the `dart/` directory, designed as a standard Python package. To integrate DART into a new vision model, you can replace the standard static patch embedding layer with the DART module. See `models_deit.py` for an example of how DART is integrated into the DeiT architecture.
