# RegionViT: Regional-to-Local Attention forVision Transformers

This repository is the official implementation of RegionViT: Regional-to-Local Attention forVision Transformers. 

## Requirements

To install requirements:

```setup
pip install -r requirements.txt
```

## Data preparation

Download and extract ImageNet train and val images from http://image-net.org/.
The directory structure is the standard layout for the torchvision [`datasets.ImageFolder`](https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder), and the training and validation data is expected to be in the `train/` folder and `val` folder respectively:

```
/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img3.jpeg
    class/2
      img4.jpeg
```

## Training


To train RegionViT-S on ImageNet on a single node with 8 gpus for 300 epochs run:

```shell script

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model regionvit_small_224 --batch-size 256 --data-path /path/to/imagenet
```

Model names of other models are `regionvit_tiny_224`, `regionvit_medium_224` and `regionvit_base_224`.

## Multinode training

Distributed training is available via Slurm and `submitit`:

To train RegionViT-S model on ImageNet on 4 nodes with 8 gpus each for 300 epochs:

```
python run_with_submitit.py --model regionvit_small_224 --data-path /path/to/imagenet --batch-size 256 --warmup-epochs 50
```
