# CrossFormer

This repository is the code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.




## Introduction

Existing vision transformers fail to build attention among objects/features of different scales (cross-scale attention), while such ability is very important to visual tasks. **CrossFormer** is a versatile vision transformer which solves this problem. Its core designs contain **C**ross-scale **E**mbedding **L**ayer (**CEL**), **L**ong-**S**hort **D**istance **A**ttention (**L/SDA**), which work together to enable cross-scale attention.

**CEL** blends every input embedding with multiple-scale features. **L/SDA** split all embeddings into several groups, and the self-attention is only computed within each group (embeddings with the same color border belong to the same group.).

![](./figures/github_pic.png)

Further, we also propose a dynamic position bias (DPB) module, which makes the effective yet inflexible relative position bias apply to variable image size.

Now, experiments are done on four representative visual tasks, *i.e.*, image classification, objection detection, and instance/semantic segmentation. Results show that CrossFormer outperforms existing vision transformers in these tasks, especially in dense prediction tasks (*i.e.*, object detection and instance/semantic segmentation). We think it is because image classification only pays attention to one object and large-scale features, while dense prediction tasks rely more on cross-scale attention.



## Prerequisites

1. Libraries (Python3.6-based)
```bash
pip3 install numpy scipy Pillow pyyaml torch==1.7.0 torchvision==0.8.1 timm==0.3.2
```
2. Dataset: ImageNet

3. Requirements for detection/instance segmentation and semantic segmentation are listed here: [detection/README.md](./detection/README.md) or [segmentation/README.md](./segmentation/README.md)



## Getting Started

### Training
```bash
## There should be two directories under the path_to_imagenet: train and validation

## CrossFormer-T
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/tiny_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --output ./output

## CrossFormer-S
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/small_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --output ./output

## CrossFormer-B
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/base_patch4_group7_224.yaml 
--batch-size 128 --data-path path_to_imagenet --output ./output

## CrossFormer-L
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/large_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --output ./output
```

### Testing
```bash
## Take CrossFormer-T as an example
python -u -m torch.distributed.launch --nproc_per_node 1 main.py --cfg configs/tiny_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --eval --resume path_to_crossformer-t.pth
```

Training scripts for objection detection: [detection/README.md](./detection/README.md).

Training scripts for semantic segmentation: [segmentation/README.md](./segmentation/README.md).



## Results

### Image Classification

Models trained on ImageNet-1K and evaluated on its validation set. The input image size is 224 x 224.

| Architectures | Params | FLOPs | Accuracy | Models |
| ------------- | ------: | -----: | --------: | :---------------- |
| ResNet-50 | 25.6M | 4.1G | 76.2% |      -        |
| RegNetY-8G | 39.0M | 8.0G | 81.7% |     -        |
| **CrossFormer-T** | **27.8M**  | **2.9G**  | **81.5%**    | - |
| **CrossFormer-S** | **30.7M**  | **4.9G**  | **82.5%**    | - |
| **CrossFormer-B** | **52.0M**  | **9.2G**  | **83.4%**    | - |
| **CrossFormer-L** | **92.0M**  | **16.1G** | **84.0%**    | - |

More results compared with other vision transformers can be seen in the paper.

### Objection Detection & Instance Segmentation

Models trained on COCO 2017. Backbones are initialized with weights pre-trained on ImageNet-1K.

| Backbone      | Detection Head | Learning Schedule | Params | FLOPs  | box AP | mask AP |
| ------------- | ----------------- | -------------------- | ------: | ------: | ------: | ------: |
| ResNet-101 | RetinaNet | 1x | 56.7M | 315.0G | 38.5 | - |
| **CrossFormer-S** | RetinaNet         | 1x                   | **40.8M**  | **282.0G** | **44.4**   | -      |
| **CrossFormer-B** | RetinaNet         | 1x                   | **62.1M**  | **389.0G** | **46.2**   | -      |
| ResNet-101 | Mask-RCNN | 1x | 63.2M | 336.0G | 40.4 | 36.4 |
| **CrossFormer-S** | Mask-RCNN        | 1x                   | **50.2M**  | **301.0G** | **45.4**   | **41.4** |
| **CrossFormer-B** | Mask-RCNN         | 1x                   | **71.5M**  | **407.9G** | **47.2**   | **42.7** |
| **CrossFormer-S** | Mask-RCNN        | 3x                   | **50.2M**  | **291.1G** | **48.7**   | **43.9** |
| **CrossFormer-B** | Mask-RCNN         | 3x                   | **71.5M**  | **398.1G** | **49.8**   | **44.5** |
| **CrossFormer-S** | Cascade-Mask-RCNN | 3x                   | **88.0M**  | **769.7G** | **52.2**   | **45.2** |

More results and pretrained models for objection detection: [detection/README.md](./detection/README.md).

### Semantic Segmentation

Models trained on ADE20K. Backbones are initialized with weights pre-trained on ImageNet-1K.

| Backbone      | Segmentation Head | Iterations | Params | FLOPs   | IOU  | MS IOU |
| ------------- | -------------------- | ----------: | ------: | -------: | ----: | ------: |
| **CrossFormer-S** | FPN                  | 80K       | **34.3M**  | **209.8G**  | **46.4** | -      |
| **CrossFormer-B** | FPN                  | 80K       | **55.6M**  | **320.1G**  | **48.0** | -      |
| **CrossFormer-L** | FPN                  | 80K       | **95.4M**  | **482.7G**  | **49.1** | -      |
| ResNet-101 | UPerNet | 160K | 86.0M | 1029.G | 44.9 | - |
| **CrossFormer-S** | UPerNet              | 160K       | **62.3M**  | **979.5G**  | **47.6** | **48.4** |
| **CrossFormer-B** | UPerNet              | 160K       | **83.6M**  | **1089.7G** | **49.7** | **50.6** |
| **CrossFormer-L** | UPerNet              | 160K       | **125.5M** | **1257.8G** | **50.4** | **51.4** |

*MS IOU means IOU with multi-scale testing.*

More results and pretrained models for semantic segmentation: [segmentation/README.md](./segmentation/README.md).
