$\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation
Abstract: Current video-language models struggle with long-video understanding due to limited context lengths and reliance on sparse frame subsampling, which often leads to information loss. In this paper, we introduce $\infty$-Video, which is able to process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism. Our framework augments video Q-formers by making them able to process unbounded video contexts efficiently and without requiring additional training. Through continuous attention, our approach dynamically allocates higher granularity to the most relevant video segments, forming "sticky" memories which evolve over time. Experiments with Video-LLaMA and VideoChat2 demonstrate improved performance in video question-answering tasks, showcasing the potential of continuous-time LTM mechanisms to enable scalable and training-free comprehension of long videos.
Lay Summary: The paper titled $\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation introduces a novel method for comprehending long videos without the need for additional training. Traditional video-language models often struggle with lengthy content due to limited context windows and reliance on sparse frame sampling, which can lead to information loss. $\infty$-Video addresses this challenge by implementing a continuous-time long-term memory (LTM) consolidation mechanism. This approach enhances existing video Q-formers, enabling them to process unbounded video contexts efficiently. By employing continuous attention, the model dynamically allocates higher granularity to the most relevant video segments, forming "sticky" memories that evolve over time. This mechanism allows the model to retain and focus on critical information throughout the video, akin to human memory consolidation processes. The effectiveness of $\infty$-Video was demonstrated through experiments with models like Video-LLaMA and VideoChat2, showing improved performance in video question-answering tasks. This indicates the potential of continuous-time LTM mechanisms to facilitate scalable and training-free comprehension of long videos.
Link To Code: https://github.com/deep-spin/Infinite-Video/tree/main
Primary Area: Deep Learning->Attention Mechanisms
Keywords: Long context, Video understanding, LLM, Continuous attention, VLLM
Submission Number: 11959
Loading