TL;DR: We propose a Language-centric Tree Reasoning (LTR) framework that performs first-order reasoning on a structured logical tree for traceable, language-driven reasoning and enhances the reasoning transparency and accuracy in Video Question-Answering
Abstract: Video Question-Answering (VideoQA) remains challenging in achieving advanced cognitive reasoning due to the uncontrollable and opaque reasoning processes in existing Multimodal Large Language Models (MLLMs). To address this issue, we propose a novel Language-centric Tree Reasoning (LTR) framework that targets on enhancing the reasoning ability of models. In detail, it recursively divides the original question into logically manageable parts and conquers them piece by piece, enhancing the reasoning capabilities and interpretability of existing MLLMs. Specifically, in the first stage, the LTR focuses on language to recursively generate a language-centric logical tree, which gradually breaks down the complex cognitive question into simple perceptual ones and plans the reasoning path through a RAG-based few-shot approach. In the second stage, with the aid of video content, the LTR performs bottom-up logical reasoning within the tree to derive the final answer along with the traceable reasoning path. Experiments across 11 VideoQA benchmarks demonstrate that our LTR framework significantly improves both accuracy and interpretability compared to state-of-the-art MLLMs. To our knowledge, this is the first work to implement a language-centric logical tree to guide MLLM reasoning in VideoQA, paving the way for language-centric video understanding from perception to cognition.
Lay Summary: For human, when asked a complex question related to a visual scene, instead of directly answering the complex question, we naturally breaks down the complex questions into simpler ones, search the result of these simple questions, and progressively reasons the answer of original complex questions based on them. However, the advanced visual understanding models can not proceeds such language-centric structured reasoning. Therefore, in this paper, we introduce such reasoning ability for large language model based visual understanding models by prompting them to emulate the former discribed human reasoning process, and significantly advances the accuracy and interpretability for the best video understanding models.
Primary Area: Applications->Computer Vision
Keywords: video question-answering, multi-modal large language model
Submission Number: 3189
Loading