Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Yuhao Dong; Shulin Tian; Shuai Liu; Shuangrui Ding; Yuhang Zang; Xiaoyi Dong; Yuhang Cao; Jiaqi Wang; Ziwei Liu

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Yuhao Dong, Shulin Tian, Shuai Liu, Shuangrui Ding, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Ziwei Liu

04 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Video Understanding, In-Context Learning

Abstract: Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic and new contexts with minimal examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark specifically designed to evaluate demo-driven video in-context learning capabilities. The Demo-ICL-Bench is constructed using 1200 instructional YouTube videos with questions, from which two types of demonstrations are derived: summarizing video subtitles for text demonstration or directly using a corresponding instructional video as video demonstration. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with two-stage training strategies: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the challenges of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, thereby unveiling future research directions.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 2071

Loading