Keywords: MLLM, cinematography understanding, VLM
TL;DR: We present ShotBench, a new benchmark for evaluating VLMs' cinematography understanding, along with the ShotQA dataset and our ShotVL model, which achieves state-of-the-art performance over both strong open-source and proprietary baselines.
Abstract: Recent Vision-Language Models (VLMs) have shown strong performance in general-purpose visual understanding and reasoning, but their ability to comprehend the visual grammar of movie shots remains underexplored and insufficiently evaluated. To bridge this gap, we present \textbf{ShotBench}, a dedicated benchmark for assessing VLMs’ understanding of cinematic language. ShotBench includes 3,049 still images and 500 video clips drawn from more than 200 films, with each sample annotated by trained annotators or curated from professional cinematography resources, resulting in 3,608 high-quality question-answer pairs. We conduct a comprehensive evaluation of over 20 state-of-the-art VLMs across eight core cinematography dimensions. Our analysis reveals clear limitations in fine-grained perception and cinematic reasoning of current VLMs. To improve VLMs capability in cinematography understanding, we construct a large-scale multimodal dataset, named ShotQA, which contains about 70k Question-Answer pairs derived from movie shots. 
Besides, we propose ShotVL and train this VLM model with a two-stage training strategy, integrating both supervised fine-tuning and Group Relative Policy Optimization (GRPO). Experimental results demonstrate that our model achieves substantial improvements, surpassing all existing strongest open-source and proprietary models evaluated on ShotBench, establishing a new state-of-the-art performance.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Flagged For Ethics Review: true
Submission Number: 4382
Loading