MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

Haoyu Ma; Shahin Mahdizadehaghdam; Bichen Wu; Zhipeng Fan; Yuchao Gu; Wenliang Zhao; Lior Shapira; Xiaohui Xie

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers

Haoyu Ma, Shahin Mahdizadehaghdam, Bichen Wu, Zhipeng Fan, Yuchao Gu, Wenliang Zhao, Lior Shapira, Xiaohui Xie

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: video editing, masked generative transformers, frame interpolation, diffusion models

TL;DR: We propose to disentangle the text-based video editing into a two stage pipeline, that involves key frames joint editing using existing image diffusion model and structure-aware frame interpolation with masked generative transformers.

Abstract: Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, causing them challenges to employ in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a select few key frames without any additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers. MaskINT specializes in frame interpolation between the key frames, benefiting from structural guidance provided by intermediate frames. The training of MaskINT incorporates masked token modeling. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.

Supplementary Material: zip

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8292

Loading