Abstract: Parameter-Efficient Fine Tuning (PEFT) has been demonstrated to be effective and efficient for transferring foundation models to downstream tasks. Transferring pretrained uni-modal models to multi-modal downstream tasks helps alleviate substantial computational costs for retraining multi-modal models. However, existing approaches primarily focus on multi-modal fusion, while neglecting the modal-specific fine-tuning, which is also crucial for multi-modal tasks. To this end, we propose parameter-efficient $Co$llaborative $P$rompt $L$earning ($CoPL$) to fine-tune both uni-modal and multi-modal features. Specifically, the collaborative prompts consist of modal-specific prompts and modal-interaction prompts. The modal-specific prompts are tailored for fine-tuning each modality, while the modal-interaction prompts are customized to explore inter-modality association. Furthermore, prompt bank-based mutual coupling is introduced to extract instance-level features, further enhancing the model's generalization ability. Extensive experimental results demonstrate that our approach achieves comparable or higher performance on various audio-visual downstream tasks while utilizing approximately 1% extra trainable parameters.
Relevance To Conference: Our study focuses on the audio-visual domain in the context of multi-modal deep learning. Multimedia video data contains rich information from both visual and audio modalities, and we believe that effectively fusing audio-visual data could greatly improve video content understanding.
Although large-scale foundation models have shown powerful generalization capabilities, the increasing size of multi-modal models results in a pressing need to reduce the substantial computational cost associated with fine-tuning these models for downstream tasks. To address this challenge, we investigate transferring pretrained foundation models to audio-visual downstream tasks, which helps mitigate the significant computational costs associated with retraining multi-modal models. Our method achieves competitive performance with only a few additional trainable parameters.
We believe that our work has the potential to advance video understanding and multi-modal learning.
Submission Number: 4292
Loading