Dynamic Adapter Merging for Continual Video Question-Answering Learning

15 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Continual Learning; Video QA; Multimodal
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: In this paper, we are the first to combine the idea of domain-specific adapter learning and model merging techniques for domain-incremental VidQA learning. The proposed DAM outperforms SOTA by 9.1% on a benchmark with 6 different datasets.
Abstract: We present a parameter-efficient method for continual video question-answering (VidQA) learning. Our method, named DAM, uses $\textbf{D}$ynamic $\textbf{A}$dapter $\textbf{M}$erging to address the issues of (i) catastrophic forgetting, (ii) the costly retraining of large VidQA models on continually shifting distribution of training data, and (iii) handling inputs from an unknown domain during test-time inference. Given a set of different VidQA datasets, we sequentially train domain-specific adapters for each VidQA dataset while freezing the parameters of a large pretrained video-language backbone. During inference, given a video-question sample from an unknown domain, our method first uses a non-parametric video-language router function to compute a probability for each domain-specific adapter, reflecting how relevant that adapter is to the current video-question input instance. Afterward, to exploit beneficial cross-domain cues and reduce the impact of potentially incorrect router predictions, we dynamically merge the parameters of several highest-scoring adapters for the final VidQA prediction. Despite the simplicity of our approach, we demonstrate that it works well on continually streaming VidQA datasets across $6$ different domains. In particular, our model outperforms prior prompt-based continual learning approaches by 9.1% while exhibiting 1.9% less forgetting. The code and pretrained models will be publicly released.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 375
Loading