Keywords: video inpainting, subtitle removal
TL;DR: end-to-end mask-free video subtitle removal
Abstract: Recent breakthroughs in video diffusion models have significantly accelerated the development of video editing techniques. However, existing methods often rely on inpainting video frames based on masked input, which requires extracting the target video mask in advance. The precision of the segmentation directly affects the quality of the completion. In this paper, we present SEDiT, a novel one-stage video $\textbf{S}$ubtitle $\textbf{E}$rasure approach via $\textbf{Di}$ffusion $\textbf{T}$ransformer. We introduce a mask-free inference approach, which enables direct erasure of targeted subtitle. The proposed one-stage framework mitigates the suboptimality inherent in the two-stage processing of prior models. To address the challenge of long-term temporal consistency, we adopt a hybrid training strategy by occasionally conditioning the model with clean first-frame latent. This facilitates temporal continuity, allowing each segment during inference to leverage the output of its predecessor. To avoid visible seams caused by cropping and reinserting processed targets, particularly in scenarios involving substantial motion, we feed the original video directly into SEDiT. Thanks to the highly compressed Variational Autoencoder (VAE) in the base model and chunk-wise streaming inference, our method can efficiently handle naive 1080p video with infinite length.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 929
Loading