Modularized Self-Reflected Video Reasoner for Multimodal LLM with Application to Video Question Answering
Abstract: Multimodal Large Language Models (Multimodal LLMs) have shown their strength in Video Question Answering (VideoQA). However, due to the black-box nature of end-to-end training strategies, existing approaches based on Multimodal LLMs suffer from the lack of interpretability for VideoQA: they can neither present reasoning paths nor indicate where the answers are derived from the video. To address this issue, we propose **MSR-ViR** (**M**odularized **S**elf-**R**eflected **Vi**deo **R**easoner), which for the first time integrates modular networks to Multimodal LLMs, capable of providing VideoQA with explicit reasoning paths for more interpretability. Specifically, a **MoST-Grounding** (Modularized Spatial-Temporal Grounding) network is proposed to decompose complex questions via tree-structured policies, localizing relevant temporal and spatial segments within videos through step-by-step reasoning. The proposed MoST-Grounding network provides explicit visually grounded information for Multimodal LLMs with clear reasoning paths, thus enhancing interpretability for the predicted answers. To further improve the reasoning quality, we design an **Alternate Self-reflection Training Strategy** to jointly optimize policy generation and Multimodal LLMs. Experiments on real-world datasets demonstrate the superiority of our proposed MSR-ViR framework in video understanding, reasoning transparency, and providing explicit localization evidence for answers.
Lay Summary: Today’s powerful Video Large Language Models can watch a video and answer questions about it—like “What did the person do after picking up the cup?”—but they often work like black boxes. We don’t know how they arrived at their answers or where in the video they got the information. To make this process more understandable, our paper introduces MSR-ViR, a new system that breaks the task into smaller, explainable steps. It uses a structured reasoning process to figure out which parts of the video are important and why. Think of it like a detective walking you through their thinking step by step. At the heart of this system is MoST-Grounding, a module that carefully selects the relevant moments and areas in the video based on the question. This information is then fed into a Large Language Model that answers the question—but now with a clear trail showing how the answer is reached. We also introduce a special training method that helps the system learn from its mistakes and improve its reasoning over time. Tests on real-world video datasets show that our method not only answers questions more accurately but also makes it easier to see and trust how those answers were formed.
Primary Area: Deep Learning->Large Language Models
Keywords: Video Question Answering, Multimodal LLM, Modular Network, Self-reflected Training
Submission Number: 6087
Loading