Keywords: Multimodal Video Understanding, In-Context Learning
Abstract: Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic and new contexts with minimal examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark specifically designed to evaluate demo-driven video in-context learning capabilities. The Demo-ICL-Bench is constructed using 1200 instructional YouTube videos with questions, from which two types of demonstrations are derived: summarizing video subtitles for text demonstration or directly using a corresponding instructional video as video demonstration. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with two-stage training strategies: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the challenges of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, thereby unveiling future research directions.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2071
Loading