Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

Zhengze Xu; Mengting Chen; Zhao Wang; Linyu XING; Zhonghua Zhai; Nong Sang; Jinsong Lan; Shuai Xiao; Changxin Gao

Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos

Zhengze Xu, Mengting Chen, Zhao Wang, Linyu XING, Zhonghua Zhai, Nong Sang, Jinsong Lan, Shuai Xiao, Changxin Gao

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Video try-on is challenging and has not been well tackled in previous works. The main obstacle lies in preserving the clothing details and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named "Tunnel Try-on." The core idea is excavating a ``focus tunnel'' in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we leverage the Kalman filter to smooth the tunnel and inject its position embedding into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels. Equipped with these techniques, Tunnel Try-on keeps fine clothing details and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos. The project page is https://mengtingchen.github.io/tunnel-try-on-page/.

Primary Subject Area: [Generation] Generative Multimedia

Secondary Subject Area: [Generation] Generative Multimedia

Relevance To Conference: Our paper aligns well with the sub-theme of "Generative Multimedia" under the broader theme of "Multimedia in the Generative AI Era". Specifically, we have developed a video virtual try-on system based on diffusion models, which takes both clothing images and user videos as input and generates high-fidelity try-on videos. Compared to image-based try-on systems that output single still images, our video virtual try-on model offers users a more immersive and realistic try-on experience. Additionally, as the first diffusion-based video virtual try-on model, our system supports various types of tops and bottoms, as well as complex backgrounds and diverse movements in real-world scenarios. This allows users to input more personalized videos for virtual try-on, thereby enhancing the overall user experience. To conclude, our work contributes to the application of video generation models in the fashion domain, making a impact in the multimedia field.

Supplementary Material: zip

Submission Number: 573

Loading