Abstract: Video inpainting, crucial for the media industry, aims to restore corrupted content. However, current methods relying on limited pixel propagation or single-branch image inpainting architectures face challenges with generating fully masked objects, balancing background preservation with foreground generation, and maintaining ID consistency over long video. To address these issues, we propose VideoPainter, an efficient dual-branch framework featuring a lightweight context encoder. This plug-and-play encoder processes masked videos and injects background guidance into any pre-trained video diffusion transformer, generalizing across arbitrary mask types, enhancing background integration and foreground generation, and enabling user-customized control. We further introduce a strategy to resample inpainting regions for maintaining ID consistency in any-length video inpainting. Additionally, we develop a scalable dataset pipeline using advanced vision models and construct VPData and VPBench—the largest video inpainting dataset with segmentation masks and dense caption (>390K clips) —to support large-scale training and evaluation. We also show VideoPainter's promising potential in downstream applications such as video editing. Extensive experiments demonstrate VideoPainter's state-of-the-art performance in any-length video inpainting and editing across $8$ key metrics, including video quality, mask region preservation, and textual coherence.
Loading