Vision transformer: To discover the "four secrets" of image patches

Tao Zhou, Yuxia Niu, Huiling Lu, Caiyue Peng, Yujie Guo, Huiyu Zhou

Published: 2024, Last Modified: 06 Nov 2025Inf. Fusion 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Highlights•Aiming to “how to divide patch?”, the 5 key techniques of patch division mechanism are summarized: from single-size division to multi-size division, from fixed number division to adaptive number division, from non-overlapping division to overlapping division, from semantic segmentation division to semantic aggregation division, and from original image division to feature map division.•Aiming to “how to select token?”, the 3 key techniques of token selection mechanism are summarized: token selection based on score, token selection based on merge, token selection based on convolution and pooling.•Aiming to “how to add position encoding?”, the 5 key techniques of position encoding mechanism are summarized: absolute position encoding, relative position encoding, conditional position encoding, locally-enhanced position encoding, and zero-padding position encoding.•Aiming to “how to calculate attention?”, 18 attention mechanisms are summarized based on the timeline.