LLM-Driven Video Inpainting with Explicit Mask Guidance and Warp-Relation Consistency

Xiaodong Xue; Yuxi Zhou

LLM-Driven Video Inpainting with Explicit Mask Guidance and Warp-Relation Consistency

Xiaodong Xue, Yuxi Zhou

19 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: video inpainting; Large Language Model

Abstract: Video inpainting is a fundamental task with wide applications in film post-production and object removal. Existing text-guided image and video editing methods typically rely on implicit conditioning by injecting text embeddings into the generation process, which lacks explicit intermediate representations and makes it difficult to precisely align the semantic space with the pixel space. To address this limitation, we propose an LLM-guided video inpainting framework that leverages a Multi-Modal large language model to generate explicit masks, followed by a mask smoothing and enhancement module for post-processing, and a video inpainting backbone for final restoration. Furthermore, We propose a Warp-Relation Consistency Mechanism that explicitly enforces temporal alignment between frames via flow-guided warping and relation-aware constraints. Extensive experiments demonstrate that our approach not only achieves state-of-the-art PSNR and SSIM, but also effectively reduces mask boundary artifacts and improves temporal consistency compared to existing methods. We will publicly release the code and pretrained models to facilitate reproducible research.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 16932

Loading