Keywords: video inpainting; Large Language Model
Abstract: Video inpainting is a fundamental task with wide applications in film post-production and object removal. Existing text-guided image and video editing methods typically rely on implicit conditioning by injecting text embeddings into the generation process, which lacks explicit intermediate representations and makes it difficult to precisely align the semantic space with the pixel space. To address this limitation, we propose an LLM-guided video inpainting framework that leverages a Multi-Modal large language model to generate explicit masks, followed by a mask smoothing and enhancement module for post-processing, and a video inpainting backbone for final restoration. Furthermore, We propose a Warp-Relation Consistency Mechanism that explicitly enforces temporal alignment between frames via flow-guided warping and relation-aware constraints. Extensive experiments demonstrate that our approach not only achieves state-of-the-art PSNR and SSIM, but also effectively reduces mask boundary artifacts and improves temporal consistency compared to existing methods. We will publicly release the code and pretrained models to facilitate reproducible research.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16932
Loading