Abstract: In this paper, we introduce a new challenging task called Zero-shot Controllable Image-to-Video Animation, where the goal is to animate an image based on motion trajectories defined by the user, without fine-tuning the base model. Primary challenges include maintaining consistency of background, consistency of object in motion, faithfulness to the user-defined trajectory, and quality of motion animation. We also introduce a novel approach for this task, leveraging diffusion models called IMG2VIDANIM-ZERO (IVA0). IVA0 tackles our controllable Image-to-Video (I2V) task by decomposing it into two subtasks: ‘out-of-place’ and ‘in-place’ motion animation. Due to this decomposition, IVA0 can leverage existing work on layout-conditioned image generation for out-of-place motion generation, and existing text-conditioned video generation methods for in-place motion animation, thus facilitating zero-shot generation. Our model also addresses key challenges for controllable animation, such as Layout Conditioning via Spatio-Temporal Masking to incorporate user guidance and Motion Afterimage Suppression (MAS) scheme to reduce object ghosting during out-of-place animation. Finally, we design a novel controllable I2V benchmark featuring diverse local- and global-level metrics. Results show IVA0 as a new state-of-the-art, establishing a new standard for the zero-shot controllable I2V task. Our method highlights the simplicity and effectiveness of task decomposition and modularization for this novel task for future studies.
Primary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: In this work, we present a novel, efficient controllable image-2-video generation model without extra training. It is based on diffusion models and allows a multimedia system to generate content following user input instructions. It also allows a more interactive framework for a better user experience.
Supplementary Material: zip
Submission Number: 3727
Loading