VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal

Published: 30 May 2024, Last Modified: 15 Aug 2024OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Video-language understanding tasks have historically focused on short video clips, often struggling with the complexities of long-form video understanding. Recently, many long video-language understanding approaches have taken advantage of the reasoning capabilities of Large Language Models (LLMs) to perform long video question answering, transforming videos into densely sampled frame captions, and asking LLMs to respond to text queries over captions. However, the frames used for captioning are often redundant and contain irrelevant information, making dense sampling inefficient, and ignoring the fact that video question-answering requires varying levels of granularity, with some video segments being highly relevant to the question (and hence needing more fine-grained detail) while others being less relevant. Thus, these LLM-based approaches are prone to missing information and operate on large numbers of irrelevant captions, lowering both performance and efficiency. To address these shortcomings, we introduce VideoTree, a queryadaptive and hierarchical framework for long-video understanding with LLMs. Specifically, VideoTree dynamically extracts query-related information from the input video and builds a tree-based video representation for LLM reasoning. First, VideoTree adaptively selects frames for captioning by clustering frames based on their visual features and scoring clusters based on their relevance to the query. We iterate this process until enough query-related keyframes are extracted. Second, it organizes visual clusters into a query-adaptive and hierarchical tree structure; the structure of the tree encodes varying levels of granularity, with higher (deeper) resolution on relevant segments. Finally, VideoTree produces an answer to each question by traversing the tree’s keyframes and passing their captions to an LLM answering model, which answers the query. Our experiments show that our trainingfree adaptive method improves both reasoning accuracy and efficiency compared to existing methods: VideoTree achieves a 7.0%, 2.2%, and 2.7% improvement in accuracy over existing methods on the popular EgoSchema, NExT-QA, and IntentQA benchmarks, respectively, while reducing inference time by 40%.