Seeking Universal Shot Language Understanding Solutions

Haoxin Liu; Harshavardhan Kamarthi; Zhiyuan Zhao; Hongjie Chen; B. Aditya Prakash

Seeking Universal Shot Language Understanding Solutions

Haoxin Liu, Harshavardhan Kamarthi, Zhiyuan Zhao, Hongjie Chen, B. Aditya Prakash

Published: 27 Apr 2026, Last Modified: 27 Apr 2026J2A PosterEveryoneRevisionsCC BY 4.0

Keywords: Film Making, Shot Language Understanding, VLMs

Paper Track: Extended Abstract (non-archival)

Abstract: Shot language understanding (SLU) is crucial for interpreting narrative, emotion, and aesthetic style in filmmaking. Although vision-language models (VLMs) demonstrate strong general capabilities, they struggle with the complex, multi-dimensional nature of cinematography, primarily due to a severe lack of high-quality training data. To bridge this gap, we introduce SLU-SUITE, a comprehensive dataset comprising 490K human-annotated image and video QA pairs that cover 33 distinct tasks across six cinematic dimensions. Leveraging SLU-SUITE, we originally identify that the core bottleneck of VLMs for SLU is semantic alignment with expert boundaries and terminology, rather than raw visual perception. Guided by this insight, we propose a balanced data scheduling and parameter-efficient strategy to train a universal SLU model, UniShot-8B. Comprehensive evaluations, including in-domain and out-of-domain settings, demonstrate the superiority of UniShot-8B.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 4

Loading