Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation

Published: 01 Jul 2024, Last Modified: 24 Jul 2024CVG PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Zero-Shot Text-to-Video Generation
Abstract: In the paradigm of AI-generated content (AIGC), there has been increasing attention to transferring knowledge from pre-trained text-to-image (T2I) models to text-to-video (T2V) generation. Despite their effectiveness, these frameworks face challenges in maintaining consistent narratives and handling shifts in scene composition or object placement from a single abstract user prompt. Exploring the ability of large language models (LLMs) to generate time-dependent, frame-by-frame prompts, this paper introduces a new framework, dubbed DirecT2V. DirecT2V leverages instruction-tuned LLMs as directors, enabling the inclusion of time-varying content and facilitating consistent video generation. To maintain temporal consistency and prevent mapping the value to a different object, we equip a diffusion model with a novel value mapping method and dual-softmax filtering.
Submission Number: 5
Loading