Keywords: Multimodal Generation, Image Editing, Video Editing, Diffusion Model
Abstract: With recent advances in Multimodal Large Language Models (MLLM) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in‑depth analysis of MLLM design choice. Moreover, the integration of MLLM and diffusion models remains an open challenge in some difficult tasks, \textit{e.g.}, video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLM and diffusion model for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. \textit{(1)} We show that training on image data can emerge video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. \textit{(2)} By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieve state-of-the-art performance.
Primary Area: generative models
Submission Number: 7173
Loading