Dynamic Grained Encoder for Vision Transformers

Lin Song; Songyang Zhang; Songtao Liu; Zeming Li; Xuming He; Hongbin Sun; Jian Sun; Nanning Zheng

Dynamic Grained Encoder for Vision Transformers

Lin Song, Songyang Zhang, Songtao Liu, Zeming Li, Xuming He, Hongbin Sun, Jian Sun, Nanning Zheng

Published: 09 Nov 2021, Last Modified: 26 May 2025NeurIPS 2021 PosterReaders: Everyone

Keywords: Computer Vision, Vision Transformer, Dynamic Network

TL;DR: This paper introduces dynamic network mechanism into Vision Transformers to reduce the spatial redundancy of image features.

Abstract: Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save computational costs. Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region. Thus it achieves a fine-grained representation in discriminative regions while keeping high efficiency. Besides, the dynamic grained encoder is compatible with most vision transformer frameworks. Without bells and whistles, our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification. Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach. Code is available at https://github.com/StevenGrove/vtpack.

Supplementary Material: pdf

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Code: https://github.com/StevenGrove/vtpack

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/dynamic-grained-encoder-for-vision/code)

14 Replies

Loading