Keywords: Human Image-to-Video Generation, Multiview Synthesis
Abstract: Recent advances in image-to-video (I2V) generation have enabled the synthesis of high-quality videos from a single input image. However, a significant limitation emerges with this single-image conditioning: models struggle to maintain appearance consistency for unseen regions, like the back of a garment. Because they lack complete information, the models are forced to hallucinate these missing views, a process that frequently introduces visual artifacts and inconsistencies. Consequently, this issue hinders their adoption in applications such as e-commerce, where visual fidelity to the actual garment is critical.
In this paper, we propose the Multiview Enhanced Image-to-Video Generation Model (MVI2V), which solves this issue by introducing other multi-views of a person or garments as extra reference images to enhance the generation process.
Specifically, MVI2V introduces an additional, structurally identical forward stream dedicated to processing the reference images. This transforms the original single- or dual-stream architecture into a dual- or triple-stream one, respectively.
Cross-stream fusion is facilitated by the self-attention mechanism, which enables bidirectional information flow among tokens of different types.
Regarding the training strategy, we incorporate an inpainting sub-task that randomly masks the region of the person in the conditioning image, thereby compelling the model to rely more heavily on guidance from the reference images.
To facilitate efficient model learning, we have meticulously designed a data curation pipeline. This pipeline selectively filters videos exhibiting large-angle variations of the subject, and then systematically extracts a comprehensive set of multi-view reference images for each video.
Extensive experiments on both Wan2.1 I2V and our in-house I2V model show that our MVI2V model can accurately reference multi-views images of a person or garment images, while simultaneously preserving the foundational I2V generation capabilities of the original model, and validate the effectiveness of the proposed network architecture, training strategy, and dataset curation pipeline. Code will be released to advance the related research.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3111
Loading