Parameter-efficient is not Sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions

Dongshuo Yin; Xueting Han; Bin Li; Hao Feng; Jing Bai

Parameter-efficient is not Sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions

Dongshuo Yin, Xueting Han, Bin Li, Hao Feng, Jing Bai

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Pre-training \& fine-tuning is a prevalent paradigm in computer vision (CV). Recently, parameter-efficient transfer learning (PETL) methods have shown promising performance in adapting to downstream tasks with only a few trainable parameters. Despite their success, the existing PETL methods in CV can be computationally expensive and require large amounts of memory and time cost during training, which limits low-resource users from conducting research and applications on large models. In this work, we propose Parameter, Memory, and Time Efficient Visual Adapter ($\mathrm{E^3VA}$) tuning to address this issue. We provide a gradient backpropagation highway for low-rank adapters which eliminates the need for expensive backpropagation through the frozen pre-trained model, resulting in substantial savings of training memory and training time. Furthermore, we optimise the $\mathrm{E^3VA}$ structure for CV tasks to promote model performance. Extensive experiments on COCO, ADE20K, and Pascal VOC benchmarks show that $\mathrm{E^3VA}$ can save up to 62.2\% training memory and 26.2\% training time on average, while achieving comparable performance to full fine-tuning and better performance than most PETL methods. Note that we can even train the Swin-Large-based Cascade Mask RCNN on GTX 1080Ti GPUs with less than 1.5\% trainable parameters.

Relevance To Conference: The most significant issue in the era of large models is the high cost of training and fine-tuning these models. For most multimedia researchers, fine-tuning and integrating existing excellent visual/language large models is an economical means to obtain outstanding multimodal large models. However, even fine-tuning large models inevitably requires substantial costs. This paper proposes an efficient visual fine-tuning paradigm that can reduce half of the memory usage and achieve performance comparable to full fine-tuning. The proposed method can effectively reduce the cost of fine-tuning multimodal large models in the future.

Supplementary Material: zip

Primary Subject Area: [Content] Media Interpretation

Secondary Subject Area: [Experience] Multimedia Applications, [Content] Vision and Language

Submission Number: 977

Loading