Abstract: While large transformer-based vision models have achieved remarkable performance on a variety of Computer Vision (CV) applications, they are cumbersome to fine-tune for target tasks. However, such models, pre-trained on large-scale datasets, are suspected to exhibit a low intrinsic dimension during fine-tuning. This suggests fewer parameters can be optimized during the fine-tuning stage to achieve a similar level of performance. Parameter-efficient Fine-tuning (PEFT) methods have been introduced to leverage this property by limiting the optimization to a subspace of the trainable parameters during fine-tuning. Utilizing low-rank projection matrices as adapter modules has been shown to be a good approach. However, it has been predominantly explored for transformer-based language models and their linear layers, while there is an increasing interest to use convolutional layers jointly with linear layers in transformer-based vision models. In this work, we introduce adapters to bypass gradient updating of the pre-trained transformer blocks of vision models during fine-tuning. Our adapters are designed using sequences of low-rank Kronecker products that provide a factorized representation of large tensors, resulting in an efficient fine-tuning parameter space for fine-tuning. Since our adapter weights are merged with the original pre-trained model weights, the proposed PEFT method does not add extra computations or memory footprints in the inference stage. Experimental results for image classification using ViT i.e., the original transformer-based vision model, and the convolution-enhanced transformer models; CeiT, and CvT substantiate the effectiveness of our proposed approach. Our proposed method outperforms state-of-the-art PEFT methods.
Loading