SEDiT: Mask-Free Video Subtitle Erasure with Prompt Instruction

Zheng Hui; Yunlong Bai

SEDiT: Mask-Free Video Subtitle Erasure with Prompt Instruction

Zheng Hui, Yunlong Bai

02 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: video inpainting, subtitle removal

TL;DR: end-to-end mask-free video subtitle removal

Abstract: Recent breakthroughs in video diffusion models have significantly accelerated the development of video editing techniques. However, existing methods often rely on inpainting video frames based on masked input, which requires extracting the target video mask in advance. The precision of the segmentation directly affects the quality of the completion. In this paper, we present SEDiT, a novel one-stage video $\textbf{S}$ubtitle $\textbf{E}$rasure approach via $\textbf{Di}$ffusion $\textbf{T}$ransformer. We introduce a mask-free inference approach, which enables direct erasure of targeted subtitle. The proposed one-stage framework mitigates the suboptimality inherent in the two-stage processing of prior models. To address the challenge of long-term temporal consistency, we adopt a hybrid training strategy by occasionally conditioning the model with clean first-frame latent. This facilitates temporal continuity, allowing each segment during inference to leverage the output of its predecessor. To avoid visible seams caused by cropping and reinserting processed targets, particularly in scenarios involving substantial motion, we feed the original video directly into SEDiT. Thanks to the highly compressed Variational Autoencoder (VAE) in the base model and chunk-wise streaming inference, our method can efficiently handle naive 1080p video with infinite length.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 929

Loading