Abstract: Video Virtual Try-On aims to transfer a garment onto a person in the video. Previous methods typically focus on image-based virtual try-on, but directly applying these methods to videos often leads to temporal discontinuity due to inconsistencies between frames. Limited attempts in video virtual try-on also suffer from unrealistic results and poor generalization ability. In light of previous research, we posit that the task of video virtual try-on can be decomposed into two key aspects: (1) single-frame results are realistic and natural, while retaining consistency with the garment; (2) the person's actions and the garment are coherent throughout the entire video. To address these two aspects, we propose a novel two-stage framework based on Latent Diffusion Model, namely Garment-Preserving Diffusion for Video Virtual Try-On (GPD-VVTO). In the first stage, the model is trained on single-frame data to improve the ability of generating high-quality try-on images. We integrate both low-level texture features and high-level semantic features of the garment into the denoising network to preserve garment details while ensuring a natural fit between the garment and the person. In the second stage, the model is trained on video data to enhance temporal consistency. We devise a novel Garment-aware Temporal Attention (GTA) module that incorporates garment features into temporal attention, enabling the model to maintain the fidelity to the garment during temporal modeling. Furthermore, we collect a video virtual try-on dataset containing high-resolution videos from diverse scenes, addressing the limited variety of current datasets in terms of video background and human actions. Extensive experiments demonstrate that our method outperforms existing state-of-the-art methods in both image-based and video-based virtual try-on tasks, indicating the effectiveness of our proposed framework.
Primary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: With the rise of e-commerce, video virtual try-on has garnered widespread attention in the industry, offering consumers an immersive and interactive online shopping experience.This work advances multimedia processing by addressing challenges in video virtual try-on. We introduce a LDM-based two-stage framework, GPD-VVTO, to ensure realistic and coherent garment transfers in videos. By integrating garment features into the denoising network, this method preserves garment details and maintains temporal consistency throughout the video. Additionally, to address the lack of diversity in existing datasets, we collect a video virtual try-on dataset from e-commerce platforms, containing high-resolution videos depicting diverse scenes and human actions, which is more beneficial for real-world applications. The proposed method outperforms state-of-the-art approaches in both image-based and video-based virtual try-on tasks.
Supplementary Material: zip
Submission Number: 1565
Loading