Abstract: In recent years, diffusion models have achieved tremendous success in the field of video generation, with controllable video generation receiving significant attention. However, existing control methods still face two limitations: Firstly, control conditions (such as depth maps, 3D Mesh) are difficult for ordinary users to obtain directly. Secondly, it’s challenging to drive multiple objects through complex motions with multiple trajectories simultaneously. In this paper, we introduce DragEntity, a video generation model that utilizes entity representation for controlling the motion of multiple objects. In comparison to previous methods, MotionCtrl offers two main advantages: 1) Trajectory-based methods are more user-friendly for interaction. Users only need to draw trajectories during the interaction to generate videos. 2) We use entity representation to represent any object in the image, and multiple objects can maintain relative spatial relationships. Therefore, we allow multiple trajectories to control multiple objects in the image with different levels of complexity simultaneously. Our experiments validate the effectiveness of DragEntity, demonstrating its superior performance in fine-grained control in video generation.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Social Aspects of Generative AI, [Generation] Generative Multimedia, [Experience] Multimedia Applications
Relevance To Conference: The primary task of this work is video generation, achieving fine-grained control through the integration of images and trajectories. The input of our method is an image and trajectory, and the output is a video, involving multiple modalities such as images, trajectories, and videos. This aligns well with the themes of multimedia/multimodal. Furthermore, addressing the object distortion issue present in existing works, we employ interactive segmentation to select the control regions in the initial frame. We also use the masks of each entity in the first frame to extract the central coordinates of that entity and predict the entity's motion trajectory through CoTrack. In comparison with existing works, our method shows improvements in both FID and FVD metrics. Our work makes a significant contribution to the task of fine-grained control in video generation.
Supplementary Material: zip
Submission Number: 4742
Loading