Towards Lifelong Video Understanding: A Survey on Continual Learning in Video Visual Question Answering

Towards Lifelong Video Understanding: A Survey on Continual Learning in Video Visual Question Answering

ACL ARR 2025 May Submission731 Authors

15 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper surveys the application of continual learning in Video Visual Question Answering (Video VQA) to advance lifelong video understanding. With the rapid progress in VQA technologies, models perform excellently in static environments but face significant challenges in real-world scenarios, particularly catastrophic forgetting when encountering new tasks or domains. We systematically review the fundamentals of video VQA, including the evolution from image to video, core architectures, and evaluation methods, and thoroughly explore how continual learning techniques are adapted to the video understanding domain. We analyze implementation strategies based on regularization, replay, parameter isolation, and hybrid methods, comparing their performance across different video VQA task streams. The paper discusses experimental evaluation frameworks, spanning task division (by question type, domain, and video style), training protocols, and baseline model selection (joint training, sequential fine-tuning, and independent training). Additionally, we identify current challenges such as long video understanding, modality imbalance, and computational efficiency concerns, while exploring future research directions and potential application scenarios. This survey aims to integrate recent advances, highlight critical trends, and provide guidance for the development of continual video VQA learning.

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: Continual learning, video vqa, vqa, survey

Contribution Types: Surveys

Languages Studied: English

Submission Number: 731

Loading