Towards Event-intensive Long Video Understanding

ACL ARR 2024 June Submission905 Authors

13 Jun 2024 (modified: 07 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the rapid development of video Multimodal Large Language Models (MLLMs), a surge of evaluation datasets is proposed to evaluate their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be easily deduced by a few frames, without watching the entire video. To address this issue, we construct an event-oriented long video understanding benchmark, \emph{\textbf{Event-Bench}}, building upon existing datasets and human annotations. The benchmark includes six event-related tasks and a total of 2,190 test instances to comprehensively evaluate the capability to understand video events. Additionally, we propose \emph{\textbf{Video Instruction Merging (VIM)}}, a low-cost method to enhance video MLLMs by using merged event-intensive video instructions, aiming to overcome the scarcity of human-annotated, event-intensive data. Extensive experiments show that the best-performing GPT-4o achieves an overall accuracy of 53.33, significantly outperforming the best open-source model by 15.62. Leveraging the effective instruction synthesis method and model architecture, our VIM outperforms both state-of-the-art open-source video MLLMs and GPT-4V on Event-Bench. All the code, data, and models will be publicly available.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, video processing, multimodality
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 905
Loading