Abstract: Highlights•We improve the data efficiency of ViT on small datasets.•Our method incorporates multi-scale tokens within global self-attention of ViT.•Our approach enables regionally cross-scale interaction through multi-scale fusion.•We introduce a novel data augmentation schedule in the training phase.•Experiments demonstrate the outperformance and data efficiency of our method.
Loading