Scene-Clipping Long Video For Better Understanding

20 Oct 2024 (modified: 05 Nov 2024)THU 2024 Fall AML SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Long-term video-LLM; Long video understanding; Video scene detection; Fine-tuning.
Abstract: In recent years, the demand for effective long video understanding has surged, driven by the increasing volume of video content across various platforms. However, existing models primarily designed for short video clips struggle to capture the complex spatiotemporal dynamics inherent in longer videos. To address this challenge, we propose a novel scene-clipping long video LLM that dynamically segments videos based on scene distribution without pre-specifying the number of clips, ensuring semantic consistency. Our method segments videos into clips, extracts frame representations using a pre-trained image encoder, and employs an entropy-based scene-clipping algorithm to generate clip embeddings through the Video-Qformer while incorporating temporal position information. This approach enables the LLM to comprehensively understand the spatiotemporal content of long videos, paving the way for enhanced applications in video summarization, question answering, and interactive video analysis. We will employ this method to train on various benchmarks and compare against different baselines to validate its effectiveness.
Submission Number: 19
Loading