Keywords: Long-form video generation, Temporal consistency evaluation, Cross-modal alignment, Narrative coherence, Multimodal benchmarking
Abstract: Recent advances in multimodal generative models have enabled long-form video creation conditioned on text, audio, and narrative prompts. While these systems demonstrate impressive visual fidelity and short-term coherence, maintaining narrative and temporal consistency over extended durations remains a critical challenge. In long video generation, failures often manifest as character identity drift, event inconsistency, semantic misalignment between modalities, or breakdowns in causal structure.
In this extended abstract, we propose a structured evaluation framework for analyzing narrative and temporal consistency in multimodal video generation systems. Rather than introducing a new generative architecture, we focus on assessing reasoning and alignment quality across three complementary dimensions: (1) temporal continuity across scenes and events, (2) multimodal semantic alignment between video, audio, and textual narration, and (3) controllability under user editing and structured constraints.
We introduce lightweight consistency metrics based on entity tracking, cross-modal embedding alignment, and event-level coherence scoring. Additionally, we design controlled prompt interventions to evaluate how well models preserve narrative structure under partial edits and conditional guidance. Experiments will be conducted on publicly available long-video generation benchmarks and synthetic narrative templates to enable reproducible evaluation.
By moving beyond short-clip visual realism toward reasoning-aware evaluation, this work aims to provide practical tools for analyzing reliability, controllability, and human--AI co-creation potential in next-generation video foundation models.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 5
Loading