Abstract: Large vision transformers (ViT) have tremendously succeeded in various computer vision tasks. These ViT models pre-trained on large datasets such as ImageNet21K and JFT-300M enjoy robustness in both low-level and high-level visual representations, and they repeatedly yield performance improvements on multiple downstream tasks. One straightforward way to inherit these robust representations is full fine-tuning. However, full fine-tuning is prone to overfitting the small downstream data by adjusting the massive weights of pre-trained large models. In addition, updating the whole parameters of pre-trained large models requires high GPU memory and computations, which limits the application of these large models. To address the above two drawbacks of full fine-tuning, in this paper, we propose a parameter-efficient tuning (PET) method dubbed Important Channel Tuning (ICT). Different from previous PET methods that adopt a trainable module to tune all the channels of a feature map, we hypothesize and corroborate experimentally that not all channels are equal for adaptation. Specifically, we design a tiny external module that determines the most informative channels in the feature map for effective adaptation. In particular, with only a simple linear layer applied to the important channels, our ICT surpasses full fine-tuning on 18 out of 19 datasets in VTAB-1K benchmark by adding only 0.11M parameters of the ViT-B, which is 0.13% of its full fine-tuning counterpart. Moreover, compared with the previous PET methods, ICT achieves the state-of-the-art average performance in the VTAB-1K benchmark with ViT and Swin Transformer backbones.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
5 Replies
Loading