Keywords: Long-term video-LLM; Long video understanding; Video scene detection; Fine-tuning.
Abstract: In recent years, the demand for effective long video understanding has surged, driven by the increasing volume of video content across various platforms. However, existing models primarily designed for short video clips struggle to capture the complex spatiotemporal dynamics inherent in longer videos. To address this challenge, we propose a novel scene-clipping long video LLM that dynamically segments videos based on scene distribution without pre-specifying the number of clips, ensuring semantic consistency. Our method segments videos into clips, extracts frame representations using a pre-trained image encoder, and employs an entropy-based scene-clipping algorithm to generate clip embeddings through the Video-Qformer while incorporating temporal position information. Our approach enables the LLM to comprehensively understand the spatiotemporal content of long videos, paving the way for enhanced applications in video summarization, question answering, and interactive video analysis. We train our proposed approach on long video QA and caption datasets and demonstrate its effectiveness on zero-shot long video understanding benchmarks, where it out performs state-of-the-art video-LLMs in absolute accuracy across most tasks.
Submission Number: 29
Loading