LVCap-Eval: Towards Holistic Long Video Caption Evaluation for Multimodal LLMs

13 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Caption, VLLMs, Benchmark
Abstract: Generating coherent and factually grounded captions for long-form videos is a critical yet underexplored challenge for multimodal large language models (MLLMs). Existing benchmarks, which predominantly feature short clips, are insufficient for evaluating a model's ability to capture narrative structure and fine-grained details over extended durations. To address this gap, we introduce LVCap-Eval, a benchmark for long-form video captioning. LVCap-Eval comprises 200 videos from six diverse domains, ranging from 2 to 20 minutes, and features a dual-dimension evaluation protocol that assesses both scene-level narrative coherence and event-level factual accuracy. To facilitate model improvement, we also provide a pipeline for generating a training corpus, demonstrating that fine-tuning with as few as 7,000 samples yields substantial gains. Our evaluation of existing MLLMs on this benchmark reveals a significant performance disparity: while leading closed-source models (e.g., Gemini-2.5-Pro) perform robustly across various video durations, their open-source counterparts degrade sharply as video length increases. Finally, our analysis of these model failures highlights potential directions for improving the long-video comprehension of MLLMs.
Primary Area: datasets and benchmarks
Submission Number: 4694
Loading