Keywords: Vision transformer, self-attention, spatial structure, sensitivity
Verify Author List: I have double-checked the author list and understand that additions and removals will not be allowed after the submission deadline.
Abstract: Self-attention operation, the core operation of the vision transformer (VT), is position-independent. Therefore, VT uses positional embedding to encode spatial information. However, we found that the role of positional encoding is very limited, and VT is insensitive to spatial structure. We demonstrated a significant sensitivity gap to random block shuffling and masking between VT and convolutional neural network (CNN), which indicates that VT does not learn the spatial structure of the target well and focuses too much on small-scale detail features.
We argue that self-attention should use position-dependent operations to encode spatial information instead of relying on positional embedding. We replace the linear projection of self-attention with convolution operation and use regular receptive field for each feature point, which significantly increases VT's sensitivity to spatial structure without sacrificing performance.
A Signed Permission To Publish Form In Pdf: pdf
Primary Area: Applications (bioinformatics, biomedical informatics, climate science, collaborative filtering, computer vision, healthcare, human activity recognition, information retrieval, natural language processing, social networks, etc.)
Paper Checklist Guidelines: I certify that all co-authors of this work have read and commit to adhering to the guidelines in Call for Papers.
Student Author: Yes
Submission Number: 213
Loading