Storytelling Video Generation with Retrieval Augmentation and Character Consistency

Published: 01 Jan 2024, Last Modified: 25 Sept 2025ECCV Workshops (5) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Despite the recent rapid advancements in text-to-video (T2V) generation, creating storytelling videos from text remains an important yet challenging task and is underexplored. In this work, we introduce an innovative storytelling video generation system that produces coherent videos with a storyline using only text prompts as input. Additionally, the generated character exhibits the same appearance across different clips, which clearly differentiates our work from other T2V models that suffer from varied character appearances. Our key novelties are twofold: a retrieval-augmented T2V generation system (RAG-T2V) and a cross-clip character consistency mechanism. The RAG-T2V consists of two functional components: (i) motion structure retrieval: searching videos of desired content and actions through query texts, and (ii) a structure-guided text-to-video generation model, generating plot-aligned videos according to text prompts and motion structure guidance. The character consistency mechanism is designed as a time-aware textual inversion process and can be learned without video character data (i.e., using only character images). Experimental results validate the storytelling video generation quality, character consistency, and semantic alignment of our proposed system, exhibiting significant advantages over various baselines.
Loading