MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

Published: 01 Jan 2024, Last Modified: 15 Nov 2024CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular rea-soning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective base-line, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on stan-dard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
Loading