Keywords: Audio to Video Generation, Keyframe Generation, Video Generation
Abstract: Generating video from various conditions, such as text, image, and audio, enables precise spatial and temporal control, leading to high-quality generation results. Most existing audio-to-visual animation models rely on uniformly sampled frames from video clips. Such a uniform sampling strategy often fails to capture key audio-visual moments in videos with dramatic motions, causing unsmooth motion transitions and audio-visual misalignment. To address these limitations, we introduce KeyVID, a keyframe-aware audio-to-visual animation framework that adaptively prioritizes the generation of keyframes in audio signals to improve the generation quality. Guided by the input audio signals, KeyVID first localizes and generates the corresponding visual keyframes that contain highly dynamic motions. The remaining frames are then synthesized using a motion interpolation module, effectively reconstructing the full video sequence. This design enables the generation of high frame-rate videos that faithfully align with audio dynamics, while avoiding the cost of directly training with all frames at a high frame rate. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions
Supplementary Material: zip
Primary Area: generative models
Submission Number: 3854
Loading