MSR-ViR: Modularized Self-reflected Video Reasoner for Video Question Answering

Zihan Song; Zi Qian; Xin Wang; Hong Chen; Yaofei Wu; Longtao Huang; Hui Xue'; Wenwu Zhu

MSR-ViR: Modularized Self-reflected Video Reasoner for Video Question Answering

Zihan Song, Zi Qian, Xin Wang, Hong Chen, Yaofei Wu, Longtao Huang, Hui Xue', Wenwu Zhu

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Question Answering, Multimodal LLM, Modular Network, Self-reflected Training

Abstract: Recently, multimodal large language models (multimodal LLMs) have been applied to a wide range of video understanding tasks, particularly for Video Question Answering (VideoQA). However, existing multimodal LLMs suffer from the following challenge: the classic end-to-end training strategies of multimodal LLMs for VideoQA tasks are black-box, thus lacking interpretability as they can neither present a reasoning path nor indicate where the answer is derived from the video. To tackle this challenge, we propose MSR-ViR (Modularized Self-Reflected Video Reasoner), a self-reflected framework that introduces a Modularized Spatial-Temporal Grounding (MoST-Grounding) module to multimodal LLMs for VideoQA tasks. MoST-Grounding utilizes a question parser LLM to generate execution policies, which serve as a reasoning path from questions to answers providing interpretability for our VideoQA framework. Based on the execution policies, MoST-Grounding invokes various small modules to localize temporal segments and spatial regions in videos which provide multimodal LLMs with most relevant visual information, while presenting visual evidence of our final answers. To avoid the question parser LLM generating unreasonable policies, we further propose a reinforcement learning-based Alternate Self-reflection training strategy to optimize the Multimodal LLM and the question parser LLM. Experiments on VideoQA datasets (NExT-QA and STAR) and grounded VideoQA dataset (NExT-GQA) demonstrate that our method significantly improves video understanding capabilities of multimodal LLMs, while providing interpretable reasoning paths together with temporal and spatial localization evidence within the video.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13177

Loading