Enhancing Fine-grained Multi-modal Alignment via Adapters: A Parameter-Efficient Training Framework for Referring Image Segmentation

Published: 18 Jun 2024, Last Modified: 06 Jul 2024WANT@ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Parameter Efficient Training; Referring Image Segmentation
TL;DR: We propose a parameter efficient training method for referring image segmentation, which surpass previous state-of-the-art methods with only 0.9% to 1.8% backbone parameter updates.
Abstract: In the domain of computer vision, Parameter-Efficient Training (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large scale models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the prevailing PET methods are primarily designed for single-modal optimization without fine-grained feature extraction design. When applied to multi-modal dense prediction tasks, these methods typically do not match the performance of full fine-tuning methods that utilize more resources. In this paper, we do an investigation of efficient training problems on referring image segmentation. We introduce DenseCrossAdapter, a parameter-efficient module designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers. This facilitates robust cross-modal feature interaction. We also suggest using text adapters to improve textual features. Our approach greatly surpasses state-of-the-art methods with only 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks.
Submission Number: 7
Loading