Keywords: Caption, VLLMs, Benchmark
Abstract: Generating coherent and factually grounded captions for long-form videos is a critical yet underexplored challenge for multimodal large language models (MLLMs).
Existing benchmarks, which predominantly feature short clips, are insufficient for evaluating a model's ability to capture narrative structure and fine-grained details over extended durations.
To address this gap, we introduce LVCap-Eval, a benchmark for long-form video captioning.
LVCap-Eval comprises 200 videos from six diverse domains, ranging from 2 to 20 minutes, and features a dual-dimension evaluation protocol that assesses both scene-level narrative coherence and event-level factual accuracy.
To facilitate model improvement, we also provide a pipeline for generating a training corpus, demonstrating that fine-tuning with as few as 7,000 samples yields substantial gains.
Our evaluation of existing MLLMs on this benchmark reveals a significant performance disparity:
while leading closed-source models (e.g., Gemini-2.5-Pro) perform robustly across various video durations, their open-source counterparts degrade sharply as video length increases.
Finally, our analysis of these model failures highlights potential directions for improving the long-video comprehension of MLLMs.
Primary Area: datasets and benchmarks
Submission Number: 4694
Loading