Understanding Long Videos with Multimodal Large Language Models

Understanding Long Videos with Multimodal Large Language Models

ACL ARR 2024 August Submission159 Authors

14 Aug 2024 (modified: 25 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Enabling large language models (LLMs) to process long videos effectively is crucial for developing advanced multimodal large language models (MLLMs). While current general-purpose MLLMs can handle short video clips, they struggle with longer videos, often failing to capture essential information in videos over one minute. This issue primarily stems from over-compression, where encoded video representations are insufficient for representing the entire video. To address this, we introduce Long Video Chat (LVCHAT), a novel approach focused on long video understanding with MLLMs. In LVCHAT, we propose Frame-Scalable Encoding (FSE) to encode the global video information where the number of video embeddings is dynamically adjust based on video duration, preventing over-compression. Moreover, we introduce Interleaved Frame Encoding (IFE), which interleaves multiple video embedding groups with positional embedding shared across groups. Experimental results demonstrate that LVCHAT significantly outperforms baselines on long-video QA and captioning tasks. Code and data will be released upon publication.

Paper Type: Short

Research Area: NLP Applications

Research Area Keywords: long video understanding, large multimodal models, large language models

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 159

Loading