Keywords: Film Making, Shot Language Understanding, VLMs
Paper Track: Extended Abstract (non-archival)
Abstract: Shot language understanding (SLU) is crucial for interpreting narrative, emotion, and aesthetic style in filmmaking. Although vision-language models (VLMs) demonstrate strong general capabilities, they struggle with the complex, multi-dimensional nature of cinematography, primarily due to a severe lack of high-quality training data. To bridge this gap, we introduce SLU-SUITE, a comprehensive dataset comprising 490K human-annotated image and video QA pairs that cover 33 distinct tasks across six cinematic dimensions. Leveraging SLU-SUITE, we originally identify that the core bottleneck of VLMs for SLU is semantic alignment with expert boundaries and terminology, rather than raw visual perception. Guided by this insight, we propose a balanced data scheduling and parameter-efficient strategy to train a universal SLU model, UniShot-8B. Comprehensive evaluations, including in-domain and out-of-domain settings, demonstrate the superiority of UniShot-8B.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 4
Loading