Keywords: Video Question Answering, Multimodal LLM, Modular Network, Self-reflected Training
Abstract: Recently, multimodal large language models (multimodal LLMs) have been applied to a wide range of video understanding tasks, particularly for Video Question Answering (VideoQA). However, existing multimodal LLMs suffer from the following challenge: the classic end-to-end training strategies of multimodal LLMs for VideoQA tasks are black-box, thus lacking interpretability as they can neither present a reasoning path nor indicate where the answer is derived from the video. To tackle this challenge, we propose MSR-ViR (Modularized Self-Reflected Video Reasoner), a self-reflected framework that introduces a Modularized Spatial-Temporal Grounding (MoST-Grounding) module to multimodal LLMs for VideoQA tasks. MoST-Grounding utilizes a question parser LLM to generate execution policies, which serve as a reasoning path from questions to answers providing interpretability for our VideoQA framework. Based on the execution policies, MoST-Grounding invokes various small modules to localize temporal segments and spatial regions in videos which provide multimodal LLMs with most relevant visual information, while presenting visual evidence of our final answers. To avoid the question parser LLM generating unreasonable policies, we further propose a reinforcement learning-based Alternate Self-reflection training strategy to optimize the Multimodal LLM and the question parser LLM. Experiments on VideoQA datasets (NExT-QA and STAR) and grounded VideoQA dataset (NExT-GQA) demonstrate that our method significantly improves video understanding capabilities of multimodal LLMs, while providing interpretable reasoning paths together with temporal and spatial localization evidence within the video.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13177
Loading