MVSBench: A Benchmark for Multi-modal Video Comprehension with Enriched Context

MVSBench: A Benchmark for Multi-modal Video Comprehension with Enriched Context

ACL ARR 2025 February Submission7587 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In recent years, notable progress has been made in Multi-modal Large Language Models (MLLMs), along with the development of various benchmarks assessing their comprehension abilities. However, most benchmarks focus on visual information understanding and QA tasks, lacking the ability to evaluate performance in complex scenarios that involve audio information and other additional context. To address this gap, we introduce the $\textit{\textbf{M}ulti-modal \textbf{V}ideo \textbf{S}tory generation \textbf{Bench}mark}$, referred to as $\textit{\textbf{MVSBench}}$, a benchmark designed to evaluate MLLMs' ability to generate narrative-style captions for long videos enriched with auxiliary information. We propose an automatic dataset construction pipeline that reduces manual annotation while ensuring fairness and reliability through filtering techniques and state-of-the-art models. Experiments indicate that current state-of-art MLLMs perform poorly under our evaluation metrics, highlighting significant limitations in generating narratives enriched with auxiliary information. To address these challenges, we propose a novel framework, $\textit{\textbf{M}ovie-to-\textbf{S}tory (M2S)}$, which outperforms other MLLMs by over 13\% on MVSBench.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: benchmarking, cross-modal content generation, cross-modal information extraction, speech and vision, automatic speech recognition, spoken language understanding, multimodality

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 7587

Loading