Struct-to-Reason: Enhancing Video Understanding of Vision-Language Models by Decoupling Perception and Reasoning via Structured Summary
Keywords: Visual Summarization, Reasoning Decoupling, Video Reasoning
TL;DR: We propose Struct-to-Reason, a two-phase framework that improves video understanding by first extracting a structured visual scratchpad and then using it for interpretable and robust reasoning across tasks.
Abstract: Humans perceive and comprehend visual scenes by forming internal mental structures that organize objects, events, and their relationships, which support complex reasoning. Inspired by this cognitive process, we introduce Struct-to-Reason, a training-free, two-phase framework that enhances video understanding of Vision-Language Models (VLMs) by explicitly decoupling perception and reasoning with a human-readable intermediate reasoning process. In the Perception Phase, the model is prompted to generate a Structured Summary, an object- and event-centric abstraction that externalizes its visual perception into a compositional and interpretable format. In the subsequent Reasoning Phase, this structured summary is reused to assist downstream tasks about the video, enabling more consistent and temporally grounded reasoning. The structured representation is both task- and model-agnostic, and reusable across multiple downstream tasks and VLMs. Experiments on multiple video understanding benchmarks show that Struct-to-Reason consistently outperforms video-to-answer prompting, chain-of-thought reasoning, and free-form summarization, demonstrating its effectiveness in enabling precise and interpretable video understanding.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 36
Loading