HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Published: 01 Jan 2025, Last Modified: 02 Oct 2025CVPR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks task-specific relevance. To address these challenges, we introduce **HierarQ**, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM's context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed **Hierar**chical **Q**uerying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on **10** video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis. All code will be made available upon acceptance.
Loading