Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories

Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories

ACL ARR 2026 January Submission2215 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: text-to-video generation, multi-stage generation, long-form video synthesis, character consistency, visual anchoring, script-based generation, multimodal models, bias analysis

Abstract: Generating long, cohesive video stories with consistent characters is a significant challenge for current text-to-video AI. We introduce a method that approaches video generation in a filmmaker-like manner. Instead of creating a video in one step, our proposed pipeline first uses a large language model to generate a detailed production script. This script guides a text-to-image model in creating consistent visuals for each character, which then serve as anchors for a video generation model to synthesize each scene individually. Our baseline comparisons validate the necessity of this multi-stage decomposition; specifically, we observe that removing the visual anchoring mechanism results in a catastrophic drop in character consistency scores (from 7.99 to 0.55), confirming that visual priors are essential for identity preservation. Furthermore, we analyze cultural disparities in current models, revealing distinct biases in subject consistency and dynamic degree between Indian vs Western-themed generations.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal content generation, multimodality, video processing, LLM/AI agents, model bias/fairness evaluation, automatic evaluation, language/cultural bias analysis, coherence

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 2215

Loading