Keywords: Large language models, cinematic plan
Abstract: Large language models (LLMs) have demonstrated strong capabilities in narrative understanding and generation, yet their outputs remain largely unstructured and difficult to integrate into downstream visual generation pipelines. While recent multimodal models attempt to directly map text to images or videos, such end-to-end approaches often lack interpretability, controllability, and compatibility with practical cinematic workflows. In this work, we introduce visual executability, a property of language model outputs that enables deterministic execution by downstream visual systems.
We propose a schema-guided framework that converts narrative text into shot-level executable cinematic plans, explicitly encoding scene context, character presence, emotional states, and cinematographic parameters such as shot type, camera movement, composition, and perspective. Our approach constrains language model generation using a strict JSON schema, ensuring structural validity, atomicity, and completeness of each visual unit. This design bridges narrative semantics and visual planning without requiring access to visual data or retraining multimodal models.
Through qualitative and quantitative evaluations on narrative texts, we demonstrate that our method produces fine-grained, coherent, and structurally valid cinematic plans that are directly usable for storyboarding and automated image or video generation pipelines. Our results suggest that enforcing visual executability at the language level offers a promising and modular alternative to end-to-end multimodal generation for controllable visual content creation.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation, language grounding, multimodality, video processing, multimodal applications
Contribution Types: NLP engineering experiment, Theory
Languages Studied: English
Submission Number: 5651
Loading