Abstract: This paper introduces a novel dataset construction pipeline
that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the
identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data
captures diverse, natural dynamics—such as non-rigid subject motion and complex camera movements—that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create
a new dataset to train InstructMove, a model capable of
instruction-based complex manipulations that are difficult
to achieve with synthetically generated datasets. Our model
demonstrates state-of-the-art performance in tasks such as
adjusting subject poses, rearranging elements, and altering
camera perspectives. The project page is available here.
Loading