SVIT: Hybrid Vision Transformer Models with Scattering Transform

Tianming Qiu, Ming Gui, Cheng Yan, Ziqing Zhao, Hao Shen

Published: 01 Jan 2022, Last Modified: 12 May 2023MLSP 2022Readers: Everyone

Abstract: Transformer architectures not only enable significant performance in natural language processing tasks, but also demon-strate comparable feature extraction ability as convolutional neural networks in computer vision tasks. Unlike words in sentences, it is difficult to specify a graphic token with definite semantic representations in pictures, which diminishes the impact of the self-attention mechanism. This work inves-tigates tokenization methods for Vision Transformer (ViT). Scattering transform shares a tiny network structure based on fixed wavelet filters. It is capable of providing adequate spatial and frequency-based tokenizations. So we propose hybrid ViT models with scattering transform called Scat-tering Vision Transformer (SViT). Experiments on image classification tasks suggest that scattering-transform-based tokenizations can improve performance of vanilla ViT, especially on small datasets or frequency-sensitive tasks such as satellite image classification. Our codes are available at https://github.com/TianmingQiu/scattering_transformer.

0 Replies