$\mathbf{T^2HTR}$: $\textbf{T}$est-$\textbf{t}$ime $\textbf{H}$ierarchical $\textbf{T}$emporal $\textbf{R}$etrieval for Long Video Understanding

02 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Long-video Understanding, Hierarchical Temporal Retrieval, Causality Evaluation, Video Pools, Closed-loop
Abstract: Foundational Multi-modal Large Language Models (MLLMs) have achieved rapid progress in handling complex tasks across diverse modalities. However, they still struggle to deliver satisfactory performance on Long-video Understanding (LVU) tasks involving thousands of frames. Existing optimization strategies can be broadly categorized into LVU-specific fine-tuning, built-in token compression and training-free keyframe extraction, with the latter being most suitable for flexible deployment across various MLLMs. Unfortunately, current training-free approaches predominantly focus on query-frame relevance retrieval, overlooking other levels of visual information and the inherent heterogeneity of LVU tasks. In this work, we propose the $\textbf{T}$est-$\textbf{t}$ime $\textbf{H}$ierarchical $\textbf{T}$emporal $\textbf{R}$etrieval ($\mathbf{T^2HTR}$) framework, which employs a multi-stage pipeline, including dual scene segmentation, joint score calculation, sub-scene window modeling and dynamic mask-based inference, to extract distinct keyframes sets from the perspectives of relevance, summarization and causality. These keyframes are then blended at varying ratios to construct multiple video sampling pools. Guided by adaptive feedback from the model, $\mathbf{T^2HTR}$ dynamically routes each sample to its optimal video pool, enabling more precise and sample-grained LVU. Extensive experiments demonstrate the advanced performance of our scheme across multiple challenging LVU benchmarks. For instance, integrating $\mathbf{T^2HTR}$ with Qwen-2.5-VL yields performance gains of 3.5\% to 13.1\% on LVB, VideoMME and MLVU.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 897
Loading