Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation

Susung Hong; Junyoung Seo; Heeseong Shin; Sunghwan Hong; Seungryong Kim

Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation

Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, Seungryong Kim

Published: 01 Jul 2024, Last Modified: 24 Jul 2024CVG PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Zero-Shot Text-to-Video Generation

Abstract: In the paradigm of AI-generated content (AIGC), there has been increasing attention to transferring knowledge from pre-trained text-to-image (T2I) models to text-to-video (T2V) generation. Despite their effectiveness, these frameworks face challenges in maintaining consistent narratives and handling shifts in scene composition or object placement from a single abstract user prompt. Exploring the ability of large language models (LLMs) to generate time-dependent, frame-by-frame prompts, this paper introduces a new framework, dubbed DirecT2V. DirecT2V leverages instruction-tuned LLMs as directors, enabling the inclusion of time-varying content and facilitating consistent video generation. To maintain temporal consistency and prevent mapping the value to a different object, we equip a diffusion model with a novel value mapping method and dual-softmax filtering.

Submission Number: 5

Loading