RT-Remover: A Real-Time Video Object Removal by Composing Tracking and Removal in Auto-Regressive Diffusion Transformers
Keywords: Streaming Video Object Removal
Abstract: With the rapid advancement of video diffusion, video editing techniques, especially video object removal, have garnered increasing attention. Existing methods generally rely on separate object tracking and inpainting stages, leading to complex and slow pipelines that are unsuitable for real-time and interactive applications. This paper is committed to designing a real-time video object remover with the minimum latency, termed as **RT-Remover**. To this end, we introduce three key innovations in this paper to enable the real-time object removal in videos. First, different from previous methods that perform tracking and inpainting individually, we compose them into a joint process. Our model only requires an initial mask for the first frame from the user and automatically removes the target objects across the whole video. Second, we leverage an auto-regressive diffusion model for a real-time video object remover. We use an auto-regressive form to predict the next chunk based on previous chunks, while use diffusion model to iteratively predict the current chunk. Meanwhile, we incorporate a fixed-length key-value cache to minimize both memory usage and computational overhead. Third, to further speed up the inference, we propose to distill the auto-regressive diffusion model using distribution matching distillation and flow matching loss, and thus reduce the number of sampling steps from 25 to 2 while preserving background consistency. All three contributions significantly simplify the pipeline and enable real-time performance. Our method achieves ***33 FPS*** and ***0.12s latency*** on a 5090 GPU with our trained faster VAE. Extensive experiments show that our approach achieves the lowest latency among existing methods while maintaining competitive visual quality.
Primary Area: generative models
Submission Number: 4866
Loading