Mask-Guided Video Generation: Enhancing Motion Control and Quality with Limited Data

SiCong Feng; Li Peng; Jielong Yang

Mask-Guided Video Generation: Enhancing Motion Control and Quality with Limited Data

SiCong Feng, Li Peng, Jielong Yang

24 Sept 2024 (modified: 26 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion models, video generation

TL;DR: Our model efficiently utilizes resources and achieves controllability and consistency in video generation through motion sequences obtained from drawings or extracted masks.

Abstract: Recent advancements in diffusion models have brought new vitality into visual content creation. However, current text-to-video generation models still face challenges such as high training costs, substantial data requirements, and difficulties in maintaining consistency between given text and motion of the foreground object. To address these challenges, we propose mask-guided video generation, which requires only a small amount of data and is trained on a single GPU. Furthermore, to mitigate the impact of background interference on controllable text-to-video generation, we utilize mask sequences obtained through drawing or extraction, along with the first-frame content, to guide video generation. Specifically, our model introduces foreground masks into existing architectures to learn region-specific attention, precisely matching text features and the motion of the foreground object. Subsequently, video generation is guided by the mask sequences to prevent the sudden disappearance of foreground objects. Our model also incorporates a first-frame sharing strategy during inference, leading to better stability in the video generation. Additionally, our approach allows for incrementally generation of longer video sequences. By employing this method, our model achieves efficient resource utilization and ensures controllability and consistency in video generation using mask sequences. Extensive qualitative and quantitative experiments demonstrate that this approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality.

Supplementary Material: zip

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3540

Loading