CoPL:Parameter-Efficient Collaborative Prompt Learning for Audio-Visual Tasks

CoPL:Parameter-Efficient Collaborative Prompt Learning for Audio-Visual Tasks

ACMMM 2024 Conference Submission4292 Authors

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Parameter-Efficient Fine Tuning (PEFT) has been demonstrated to be effective and efficient for transferring foundation models to downstream tasks. Transferring pretrained uni-modal models to multi-modal downstream tasks helps alleviate substantial computational costs for retraining multi-modal models. However, existing approaches primarily focus on multi-modal fusion, while neglecting the modal-specific fine-tuning, which is also crucial for multi-modal tasks. To this end, we propose parameter-efficient $Co$llaborative $P$rompt $L$earning ($CoPL$) to fine-tune both uni-modal and multi-modal features. Specifically, the collaborative prompts consist of modal-specific prompts and modal-interaction prompts. The modal-specific prompts are tailored for fine-tuning each modality, while the modal-interaction prompts are customized to explore inter-modality association. Furthermore, prompt bank-based mutual coupling is introduced to extract instance-level features, further enhancing the model's generalization ability. Extensive experimental results demonstrate that our approach achieves comparable or higher performance on various audio-visual downstream tasks while utilizing approximately 1% extra trainable parameters.

Relevance To Conference: Our study focuses on the audio-visual domain in the context of multi-modal deep learning. Multimedia video data contains rich information from both visual and audio modalities, and we believe that effectively fusing audio-visual data could greatly improve video content understanding. Although large-scale foundation models have shown powerful generalization capabilities, the increasing size of multi-modal models results in a pressing need to reduce the substantial computational cost associated with fine-tuning these models for downstream tasks. To address this challenge, we investigate transferring pretrained foundation models to audio-visual downstream tasks, which helps mitigate the significant computational costs associated with retraining multi-modal models. Our method achieves competitive performance with only a few additional trainable parameters. We believe that our work has the potential to advance video understanding and multi-modal learning.

Submission Number: 4292

Loading