Keywords: Artificial Intelligence Generated Content, Large Language Model, Text-to-Video Generation, Prompt Refinement
Abstract: Recent years have witnessed rapid progress of diffusion models, which significantly advance the development of Text-to-Video (T2V) generation. Compared to Text-to-Image (T2I) generation, T2V models encounter additional challenges, including temporal consistency, motion coherence, and adherence to physical constraints across frames. To address these challenges, we propose a novel two-stage framework, i.e., Complex-Scenario-Aware Prompt Refinement (CSAPR), to improve prompt the quality for T2V generation. CSAPR consists of two stages, i.e., prompt refinement and prompt verification. In the prompt refinement stage, CSAPR classifies user prompts into one of eight representative categories and applies targeted rewriting strategies guided by predefined meta prompts. In the prompt verification stage, CSAPR aligns semantic atoms from the original prompt with decomposed chunks of the refined prompt, ensuring that the refined prompt faithfully preserves the intended semantics while avoiding inconsistencies. Extensive experiments on three benchmarks, i.e., VBench, EvalCrafter, and T2V-CompBench, demonstrate that CSAPR significantly improves alignment with user intent and video generation quality in complex scenarios (up to 1.40\% in terms of average score).
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7457
Loading