The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

ACL ARR 2026 January Submission2279 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Generation, Multi-agent Systems, Cinematic Script, Multimodal Learning
Abstract: Recent advances in video generation have enabled high-fidelity short clips from text prompts, but generating long-horizon, dialogue-driven cinematic sequences remains challenging. A key bottleneck is a "semantic gap" between sparse conversational intent and the fine-grained, executable cinematic plan required for shot design, camera control, and continuity. We propose an end-to-end, script-centric agentic framework for dialogue-to-cinematic video generation. Our framework first uses ScripterAgent to translate coarse dialogue into a structured, shot-level cinematic script. To support this step, we construct ScriptBench, a benchmark of 1,750 instances annotated via an expert-guided pipeline with multi-round verification. ScripterAgent is trained via SFT for structural competence, followed by RL for cinematic alignment. The generated scripts are then executed by DirectorAgent, which orchestrates state-of-the-art video models using shot-aware segmentation and a frame-anchored cross-scene continuous generation strategy to improve long-horizon coherence. Comprehensive evaluation with both CriticAgent and human experts shows that conditioning video models on our scripts consistently increases human-rated script faithfulness alongside character consistency and narrative coherence. Our results also reveal a practical trade-off in current video generators between visual spectacle and strict script adherence.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: cross-modal content generation, video processing, speech and vision
Languages Studied: English
Submission Number: 2279
Loading