Abstract: Enabling large language models (LLMs) to process long videos effectively is crucial for developing advanced multimodal large language models (MLLMs). While current general-purpose MLLMs can handle short video clips, they struggle with longer videos, often failing to capture essential information in videos over one minute. This issue primarily stems from over-compression, where encoded video representations are insufficient for representing the entire video. To address this, we introduce Long Video Chat (LVCHAT), a novel approach focused on long video understanding with MLLMs. In LVCHAT, we propose Frame-Scalable Encoding (FSE) to encode the global video information where the number of video embeddings is dynamically adjust based on video duration, preventing over-compression. Moreover, we introduce Interleaved Frame Encoding (IFE), which interleaves multiple video embedding groups with positional embedding shared across groups. Experimental results demonstrate that LVCHAT significantly outperforms baselines on long-video QA and captioning tasks. Code and data will be released upon publication.
Paper Type: Short
Research Area: NLP Applications
Research Area Keywords: long video understanding, large multimodal models, large language models
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 159
Loading